My personal LLM evals
Part of the series "One month of LLMs"
A lot has changed in the last six months in AI. To get caught up on the capabilities of the latest models (as well as some older, cheaper ones), I wrote my own evals for 6 real-world tasks:
codereview
: Read a raw Git diff and look for problems.hnrec
: Recommend articles from a print-out of the Hacker News front page based on a set of topics of interest.loghunt
: Find suspicious lines in a web server's logs.mdparse
: Write a Python function to parse a Markdown string into a structured document.proofread
: Proofread a draft of a blog post for typos and grammatical errors.sqlquery
: Write a moderately complex SQL query.
These evals were run against 3 models each from Google (Gemini 2.5 Pro, 2.5 Flash, and 2.5 Flash-Lite), OpenAI (GPT 4o, 4.1 Mini, and 4.1 Nano), and Anthropic (Claude Opus 4, Sonnet 4, and Haiku 3.5).
Key findings
- No one model was clearly superior across the board. Gemini 2.5 Pro dominated at the
codereview
eval, but all the others had 2 or 3 models that were competitive at the top. - Response quality is not always consistent. Given the exact same prompt, Gemini Flash-Lite produced a perfect solution to
mdparse
, and a solution that failed half the tests, within a span of 5 minutes. - The most expensive models aren't always the best. Sonnet matches or surpasses Opus on 4 of my 6 benchmarks; Gemini Flash-Lite's peak performance on
mdparse
was better than Gemini Pro's – and 4x faster. - The best model differs by task. Sonnet topped the
loghunt
benchmark but was second-to-last inmdparse
; GPT 4o had the best average performance onmdparse
but was middling atproofread
.
codereview
https://github.com/iafisher/llm-evals/tree/master/codereview
codereview
has the LLM look at a real ~1,000 diff that I took from one of my projects and edited it to create problems. (I first tried to get ChatGPT to produce a problematic diff on its own, without success.)
- I removed the
f
prefix from a Python f-string. - A copy-paste error with the name of a logger not matching the name of a module. (This was actually in the original diff, and I discovered it by running this eval.)
- I inverted an integer comparison to got the wrong way.
- I gave a test function a misleading name.
- I created a copy-paste error in a sequence of long test assertions by changing
lines[2]
in the third line tolines[1]
.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
google-gemini-2.5-pro | 6 / 6 | 5 / 6 | 5.33 | 24.4s | 0.0% |
google-gemini-2.5-flash | 4 / 6 | 3 / 6 | 3.67 | 7.8s | 0.0% |
openai-gpt-4o | 4 / 6 | 3 / 6 | 3.33 | 5.0s | 0.0% |
anthropic-claude-sonnet-4-20250514 | 4 / 6 | 0 / 6 | 2.67 | 9.6s | 33.3% |
google-gemini-2.5-flash-lite-preview-06-17 | 3 / 6 | 2 / 6 | 2.33 | 4.2s | 0.0% |
openai-gpt-4.1-mini | 2 / 6 | 2 / 6 | 2.00 | 2.3s | 0.0% |
anthropic-claude-opus-4-20250514 | 2 / 6 | 1 / 6 | 1.67 | 6.6s | 0.0% |
openai-gpt-4.1-nano | 1 / 6 | 1 / 6 | 1.00 | 1.1s | 0.0% |
anthropic-claude-3-5-haiku-20241022 | 1 / 6 | 0 / 6 | 0.67 | 2.6s | 33.3% |
hnrec
https://github.com/iafisher/llm-evals/tree/master/hnrec
I scraped the front page of Hacker News (titles, URLs, and top comments) to create the hnrec
eval for recommending content. This one is a bit subjective – I picked a few stories that I felt it should recommend based on the criteria, a larger pool that it wouldn't be penalized for recommending, and a few that specifically failed the criteria and should not be recommended.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
openai-gpt-4.1-mini | 6 / 7 | 5 / 7 | 5.33 | 0.9s | 0.0% |
google-gemini-2.5-flash-lite-preview-06-17 | 6 / 7 | 5 / 7 | 5.33 | 5.5s | 0.0% |
anthropic-claude-opus-4-20250514 | 6 / 7 | 5 / 7 | 5.33 | 11.7s | 0.0% |
openai-gpt-4o | 5 / 7 | 5 / 7 | 5.00 | 1.8s | 0.0% |
google-gemini-2.5-flash | 5 / 7 | 5 / 7 | 5.00 | 6.7s | 0.0% |
anthropic-claude-sonnet-4-20250514 | 5 / 7 | 4 / 7 | 4.67 | 7.0s | 0.0% |
google-gemini-2.5-pro | 5 / 7 | 4 / 7 | 4.33 | 25.8s | 0.0% |
openai-gpt-4.1-nano | 4 / 7 | 3 / 7 | 3.33 | 0.6s | 0.0% |
anthropic-claude-3-5-haiku-20241022 | 0 / 7 | 0 / 7 | 0.00 | 5.1s | 100.0% |
loghunt
https://github.com/iafisher/llm-evals/tree/master/loghunt
In loghunt
, the LLM looks through 1,000 lines of webserver logs to find evidence of suspicious activity.
ChatGPT wrote the Python script that generated the log file.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
anthropic-claude-sonnet-4-20250514 | 7 / 7 | 7 / 7 | 7.00 | 11.7s | 0.0% |
anthropic-claude-opus-4-20250514 | 7 / 7 | 7 / 7 | 7.00 | 18.6s | 0.0% |
openai-gpt-4o | 7 / 7 | 7 / 7 | 7.00 | 19.2s | 0.0% |
google-gemini-2.5-pro | 7 / 7 | 7 / 7 | 7.00 | 24.4s | 0.0% |
openai-gpt-4.1-mini | 7 / 7 | 5 / 7 | 6.00 | 16.4s | 0.0% |
google-gemini-2.5-flash | 7 / 7 | 0 / 7 | 4.67 | 16.0s | 33.3% |
anthropic-claude-3-5-haiku-20241022 | 5 / 7 | 0 / 7 | 1.67 | 15.2s | 66.7% |
google-gemini-2.5-flash-lite-preview-06-17 | 1 / 7 | 1 / 7 | 1.00 | 3.3s | 0.0% |
openai-gpt-4.1-nano | 1 / 7 | 0 / 7 | 0.67 | 1.9s | 33.3% |
mdparse
https://github.com/iafisher/llm-evals/tree/master/mdparse
mdparse
is a coding eval – write a moderately complex Python function to parse Markdown. I included a lot of specifics with how to deal with edge cases (whitespace preservation, errant title headers, etc.) to give the LLM a fair chance to pass the full test suite. The test suite itself was not given.
ChatGPT wrote most of the test cases.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
google-gemini-2.5-flash-lite-preview-06-17 | 13 / 13 | 7 / 13 | 10.67 | 7.1s | 0.0% |
openai-gpt-4o | 12 / 13 | 10 / 13 | 11.00 | 8.2s | 0.0% |
anthropic-claude-opus-4-20250514 | 12 / 13 | 9 / 13 | 10.00 | 19.6s | 0.0% |
google-gemini-2.5-pro | 12 / 13 | 1 / 13 | 8.33 | 30.7s | 0.0% |
openai-gpt-4.1-nano | 12 / 13 | 0 / 13 | 8.00 | 4.7s | 0.0% |
google-gemini-2.5-flash | 12 / 13 | 0 / 13 | 5.33 | 15.5s | 0.0% |
openai-gpt-4.1-mini | 11 / 13 | 9 / 13 | 10.33 | 21.8s | 0.0% |
anthropic-claude-sonnet-4-20250514 | 11 / 13 | 8 / 13 | 9.33 | 17.6s | 0.0% |
anthropic-claude-3-5-haiku-20241022 | 8 / 13 | 4 / 13 | 5.67 | 13.6s | 0.0% |
proofread
https://github.com/iafisher/llm-evals/tree/master/proofread
proofread
tests the model's facility with English prose: the LLM reads a draft of one of my blog posts in which I introduced a few intentional errors.
- I misspelled 'implementation' as 'implementaton'.
- I changed 'though it pushed the limits' to 'though it push the limits'.
- I left an 'EDIT' marker in the text.
- I removed an opening parenthesis.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
anthropic-claude-sonnet-4-20250514 | 5 / 5 | 5 / 5 | 5.00 | 5.4s | 0.0% |
google-gemini-2.5-pro | 5 / 5 | 5 / 5 | 5.00 | 15.3s | 0.0% |
google-gemini-2.5-flash | 5 / 5 | 3 / 5 | 4.00 | 1.4s | 0.0% |
google-gemini-2.5-flash-lite-preview-06-17 | 5 / 5 | 2 / 5 | 4.00 | 3.5s | 0.0% |
openai-gpt-4o | 4 / 5 | 3 / 5 | 3.67 | 1.9s | 0.0% |
anthropic-claude-opus-4-20250514 | 3 / 5 | 3 / 5 | 3.00 | 11.7s | 0.0% |
openai-gpt-4.1-nano | 2 / 5 | 2 / 5 | 2.00 | 0.7s | 0.0% |
openai-gpt-4.1-mini | 2 / 5 | 0 / 5 | 1.33 | 0.9s | 33.3% |
anthropic-claude-3-5-haiku-20241022 | 0 / 5 | 0 / 5 | 0.00 | 2.4s | 100.0% |
sqlquery
https://github.com/iafisher/llm-evals/tree/master/sqlquery
sqlquery
gives the LLM a SQL schema and asks it to write a query from a detailed prose description. The eval runs it against an in-memory SQLite database populated with test data.
ChatGPT helped me with both the prompt, the solution, and the test data, so perhaps it's not a surprise that most of the models crushed this benchmark.
Model | Best | Worst | Mean | Mean duration | Malformed |
---|---|---|---|---|---|
google-gemini-2.5-flash-lite-preview-06-17 | 6 / 6 | 6 / 6 | 6.00 | 3.7s | 0.0% |
openai-gpt-4o | 6 / 6 | 6 / 6 | 6.00 | 6.6s | 0.0% |
openai-gpt-4.1-mini | 6 / 6 | 6 / 6 | 6.00 | 7.3s | 0.0% |
anthropic-claude-sonnet-4-20250514 | 6 / 6 | 6 / 6 | 6.00 | 8.0s | 0.0% |
google-gemini-2.5-flash | 6 / 6 | 6 / 6 | 6.00 | 8.6s | 0.0% |
anthropic-claude-opus-4-20250514 | 6 / 6 | 6 / 6 | 6.00 | 13.1s | 0.0% |
google-gemini-2.5-pro | 6 / 6 | 6 / 6 | 6.00 | 22.0s | 0.0% |
openai-gpt-4.1-nano | 3 / 6 | 3 / 6 | 3.00 | 1.5s | 0.0% |
anthropic-claude-3-5-haiku-20241022 | 3 / 6 | 3 / 6 | 3.00 | 5.9s | 0.0% |
Technical details
Each eval was run for each model 3 times. I set temperature to 0.7 and max tokens to 4,000 for all models. For Gemini, I set the thinking budget to be 2,000 tokens. With the default 'dynamic' thinking budget, it would frequently use up all its tokens thinking and stop in the middle of the final output.
'Malformed' responses could not be parsed to extract the actual solution (e.g., last line was not JSON, or JSON was invalid). The system prompts always give clear instructions on how to format the output, but even the best models sometimes stray. I tried to write the scoring scripts to be liberal in what they accepted.
Only mdparse
and sqlquery
involved actually executing LLM-written code, but I ran all evals with sandbox-exec
(macOS) to be safe.