notes > 2025-07-my-personal-llm-evals

My personal LLM evals

Part of the series "One month of LLMs"

A lot has changed in the last six months in AI. To get caught up on the capabilities of the latest models (as well as some older, cheaper ones), I wrote my own evals for 6 real-world tasks:

codereview: Read a raw Git diff and look for problems.
hnrec: Recommend articles from a print-out of the Hacker News front page based on a set of topics of interest.
loghunt: Find suspicious lines in a web server's logs.
mdparse: Write a Python function to parse a Markdown string into a structured document.
proofread: Proofread a draft of a blog post for typos and grammatical errors.
sqlquery: Write a moderately complex SQL query.

These evals were run against 3 models each from Google (Gemini 2.5 Pro, 2.5 Flash, and 2.5 Flash-Lite), OpenAI (GPT 4o, 4.1 Mini, and 4.1 Nano), and Anthropic (Claude Opus 4, Sonnet 4, and Haiku 3.5).

Key findings

No one model was clearly superior across the board. Gemini 2.5 Pro dominated at the codereview eval, but all the others had 2 or 3 models that were competitive at the top.
Response quality is not always consistent. Given the exact same prompt, Gemini Flash-Lite produced a perfect solution to mdparse, and a solution that failed half the tests, within a span of 5 minutes.
The most expensive models aren't always the best. Sonnet matches or surpasses Opus on 4 of my 6 benchmarks; Gemini Flash-Lite's peak performance on mdparse was better than Gemini Pro's – and 4x faster.
The best model differs by task. Sonnet topped the loghunt benchmark but was second-to-last in mdparse; GPT 4o had the best average performance on mdparse but was middling at proofread.

`codereview`

https://github.com/iafisher/llm-evals/tree/master/codereview

codereview has the LLM look at a real ~1,000 diff that I took from one of my projects and edited it to create problems. (I first tried to get ChatGPT to produce a problematic diff on its own, without success.)

I removed the f prefix from a Python f-string.
A copy-paste error with the name of a logger not matching the name of a module. (This was actually in the original diff, and I discovered it by running this eval.)
I inverted an integer comparison to got the wrong way.
I gave a test function a misleading name.
I created a copy-paste error in a sequence of long test assertions by changing lines[2] in the third line to lines[1].

Model	Best	Worst	Mean	Mean duration	Malformed
google-gemini-2.5-pro	6 / 6	5 / 6	5.33	24.4s	0.0%
google-gemini-2.5-flash	4 / 6	3 / 6	3.67	7.8s	0.0%
openai-gpt-4o	4 / 6	3 / 6	3.33	5.0s	0.0%
anthropic-claude-sonnet-4-20250514	4 / 6	0 / 6	2.67	9.6s	33.3%
google-gemini-2.5-flash-lite-preview-06-17	3 / 6	2 / 6	2.33	4.2s	0.0%
openai-gpt-4.1-mini	2 / 6	2 / 6	2.00	2.3s	0.0%
anthropic-claude-opus-4-20250514	2 / 6	1 / 6	1.67	6.6s	0.0%
openai-gpt-4.1-nano	1 / 6	1 / 6	1.00	1.1s	0.0%
anthropic-claude-3-5-haiku-20241022	1 / 6	0 / 6	0.67	2.6s	33.3%

`hnrec`

https://github.com/iafisher/llm-evals/tree/master/hnrec

I scraped the front page of Hacker News (titles, URLs, and top comments) to create the hnrec eval for recommending content. This one is a bit subjective – I picked a few stories that I felt it should recommend based on the criteria, a larger pool that it wouldn't be penalized for recommending, and a few that specifically failed the criteria and should not be recommended.

Model	Best	Worst	Mean	Mean duration	Malformed
openai-gpt-4.1-mini	6 / 7	5 / 7	5.33	0.9s	0.0%
google-gemini-2.5-flash-lite-preview-06-17	6 / 7	5 / 7	5.33	5.5s	0.0%
anthropic-claude-opus-4-20250514	6 / 7	5 / 7	5.33	11.7s	0.0%
openai-gpt-4o	5 / 7	5 / 7	5.00	1.8s	0.0%
google-gemini-2.5-flash	5 / 7	5 / 7	5.00	6.7s	0.0%
anthropic-claude-sonnet-4-20250514	5 / 7	4 / 7	4.67	7.0s	0.0%
google-gemini-2.5-pro	5 / 7	4 / 7	4.33	25.8s	0.0%
openai-gpt-4.1-nano	4 / 7	3 / 7	3.33	0.6s	0.0%
anthropic-claude-3-5-haiku-20241022	0 / 7	0 / 7	0.00	5.1s	100.0%

`loghunt`

https://github.com/iafisher/llm-evals/tree/master/loghunt

In loghunt, the LLM looks through 1,000 lines of webserver logs to find evidence of suspicious activity.

ChatGPT wrote the Python script that generated the log file.

Model	Best	Worst	Mean	Mean duration	Malformed
anthropic-claude-sonnet-4-20250514	7 / 7	7 / 7	7.00	11.7s	0.0%
anthropic-claude-opus-4-20250514	7 / 7	7 / 7	7.00	18.6s	0.0%
openai-gpt-4o	7 / 7	7 / 7	7.00	19.2s	0.0%
google-gemini-2.5-pro	7 / 7	7 / 7	7.00	24.4s	0.0%
openai-gpt-4.1-mini	7 / 7	5 / 7	6.00	16.4s	0.0%
google-gemini-2.5-flash	7 / 7	0 / 7	4.67	16.0s	33.3%
anthropic-claude-3-5-haiku-20241022	5 / 7	0 / 7	1.67	15.2s	66.7%
google-gemini-2.5-flash-lite-preview-06-17	1 / 7	1 / 7	1.00	3.3s	0.0%
openai-gpt-4.1-nano	1 / 7	0 / 7	0.67	1.9s	33.3%

`mdparse`

https://github.com/iafisher/llm-evals/tree/master/mdparse

mdparse is a coding eval – write a moderately complex Python function to parse Markdown. I included a lot of specifics with how to deal with edge cases (whitespace preservation, errant title headers, etc.) to give the LLM a fair chance to pass the full test suite. The test suite itself was not given.

ChatGPT wrote most of the test cases.

Model	Best	Worst	Mean	Mean duration	Malformed
google-gemini-2.5-flash-lite-preview-06-17	13 / 13	7 / 13	10.67	7.1s	0.0%
openai-gpt-4o	12 / 13	10 / 13	11.00	8.2s	0.0%
anthropic-claude-opus-4-20250514	12 / 13	9 / 13	10.00	19.6s	0.0%
google-gemini-2.5-pro	12 / 13	1 / 13	8.33	30.7s	0.0%
openai-gpt-4.1-nano	12 / 13	0 / 13	8.00	4.7s	0.0%
google-gemini-2.5-flash	12 / 13	0 / 13	5.33	15.5s	0.0%
openai-gpt-4.1-mini	11 / 13	9 / 13	10.33	21.8s	0.0%
anthropic-claude-sonnet-4-20250514	11 / 13	8 / 13	9.33	17.6s	0.0%
anthropic-claude-3-5-haiku-20241022	8 / 13	4 / 13	5.67	13.6s	0.0%

`proofread`

https://github.com/iafisher/llm-evals/tree/master/proofread

proofread tests the model's facility with English prose: the LLM reads a draft of one of my blog posts in which I introduced a few intentional errors.

I misspelled 'implementation' as 'implementaton'.
I changed 'though it pushed the limits' to 'though it push the limits'.
I left an 'EDIT' marker in the text.
I removed an opening parenthesis.

Model	Best	Worst	Mean	Mean duration	Malformed
anthropic-claude-sonnet-4-20250514	5 / 5	5 / 5	5.00	5.4s	0.0%
google-gemini-2.5-pro	5 / 5	5 / 5	5.00	15.3s	0.0%
google-gemini-2.5-flash	5 / 5	3 / 5	4.00	1.4s	0.0%
google-gemini-2.5-flash-lite-preview-06-17	5 / 5	2 / 5	4.00	3.5s	0.0%
openai-gpt-4o	4 / 5	3 / 5	3.67	1.9s	0.0%
anthropic-claude-opus-4-20250514	3 / 5	3 / 5	3.00	11.7s	0.0%
openai-gpt-4.1-nano	2 / 5	2 / 5	2.00	0.7s	0.0%
openai-gpt-4.1-mini	2 / 5	0 / 5	1.33	0.9s	33.3%
anthropic-claude-3-5-haiku-20241022	0 / 5	0 / 5	0.00	2.4s	100.0%

`sqlquery`

https://github.com/iafisher/llm-evals/tree/master/sqlquery

sqlquery gives the LLM a SQL schema and asks it to write a query from a detailed prose description. The eval runs it against an in-memory SQLite database populated with test data.

ChatGPT helped me with both the prompt, the solution, and the test data, so perhaps it's not a surprise that most of the models crushed this benchmark.

Model	Best	Worst	Mean	Mean duration	Malformed
google-gemini-2.5-flash-lite-preview-06-17	6 / 6	6 / 6	6.00	3.7s	0.0%
openai-gpt-4o	6 / 6	6 / 6	6.00	6.6s	0.0%
openai-gpt-4.1-mini	6 / 6	6 / 6	6.00	7.3s	0.0%
anthropic-claude-sonnet-4-20250514	6 / 6	6 / 6	6.00	8.0s	0.0%
google-gemini-2.5-flash	6 / 6	6 / 6	6.00	8.6s	0.0%
anthropic-claude-opus-4-20250514	6 / 6	6 / 6	6.00	13.1s	0.0%
google-gemini-2.5-pro	6 / 6	6 / 6	6.00	22.0s	0.0%
openai-gpt-4.1-nano	3 / 6	3 / 6	3.00	1.5s	0.0%
anthropic-claude-3-5-haiku-20241022	3 / 6	3 / 6	3.00	5.9s	0.0%

Technical details

Each eval was run for each model 3 times. I set temperature to 0.7 and max tokens to 4,000 for all models. For Gemini, I set the thinking budget to be 2,000 tokens. With the default 'dynamic' thinking budget, it would frequently use up all its tokens thinking and stop in the middle of the final output.

'Malformed' responses could not be parsed to extract the actual solution (e.g., last line was not JSON, or JSON was invalid). The system prompts always give clear instructions on how to format the output, but even the best models sometimes stray. I tried to write the scoring scripts to be liberal in what they accepted.

Only mdparse and sqlquery involved actually executing LLM-written code, but I ran all evals with sandbox-exec (macOS) to be safe.

Table of contents

My personal LLM evals

Key findings

codereview

hnrec

loghunt

mdparse

proofread