home blog portfolio Ian Fisher

My personal LLM evals

Part of the series "One month of LLMs"

A lot has changed in the last six months in AI. To get caught up on the capabilities of the latest models (as well as some older, cheaper ones), I wrote my own evals for 6 real-world tasks:

These evals were run against 3 models each from Google (Gemini 2.5 Pro, 2.5 Flash, and 2.5 Flash-Lite), OpenAI (GPT 4o, 4.1 Mini, and 4.1 Nano), and Anthropic (Claude Opus 4, Sonnet 4, and Haiku 3.5).

Key findings

codereview

https://github.com/iafisher/llm-evals/tree/master/codereview

codereview has the LLM look at a real ~1,000 diff that I took from one of my projects and edited it to create problems. (I first tried to get ChatGPT to produce a problematic diff on its own, without success.)

Model Best Worst Mean Mean duration Malformed
google-gemini-2.5-pro 6 / 6 5 / 6 5.33 24.4s 0.0%
google-gemini-2.5-flash 4 / 6 3 / 6 3.67 7.8s 0.0%
openai-gpt-4o 4 / 6 3 / 6 3.33 5.0s 0.0%
anthropic-claude-sonnet-4-20250514 4 / 6 0 / 6 2.67 9.6s 33.3%
google-gemini-2.5-flash-lite-preview-06-17 3 / 6 2 / 6 2.33 4.2s 0.0%
openai-gpt-4.1-mini 2 / 6 2 / 6 2.00 2.3s 0.0%
anthropic-claude-opus-4-20250514 2 / 6 1 / 6 1.67 6.6s 0.0%
openai-gpt-4.1-nano 1 / 6 1 / 6 1.00 1.1s 0.0%
anthropic-claude-3-5-haiku-20241022 1 / 6 0 / 6 0.67 2.6s 33.3%

hnrec

https://github.com/iafisher/llm-evals/tree/master/hnrec

I scraped the front page of Hacker News (titles, URLs, and top comments) to create the hnrec eval for recommending content. This one is a bit subjective – I picked a few stories that I felt it should recommend based on the criteria, a larger pool that it wouldn't be penalized for recommending, and a few that specifically failed the criteria and should not be recommended.

Model Best Worst Mean Mean duration Malformed
openai-gpt-4.1-mini 6 / 7 5 / 7 5.33 0.9s 0.0%
google-gemini-2.5-flash-lite-preview-06-17 6 / 7 5 / 7 5.33 5.5s 0.0%
anthropic-claude-opus-4-20250514 6 / 7 5 / 7 5.33 11.7s 0.0%
openai-gpt-4o 5 / 7 5 / 7 5.00 1.8s 0.0%
google-gemini-2.5-flash 5 / 7 5 / 7 5.00 6.7s 0.0%
anthropic-claude-sonnet-4-20250514 5 / 7 4 / 7 4.67 7.0s 0.0%
google-gemini-2.5-pro 5 / 7 4 / 7 4.33 25.8s 0.0%
openai-gpt-4.1-nano 4 / 7 3 / 7 3.33 0.6s 0.0%
anthropic-claude-3-5-haiku-20241022 0 / 7 0 / 7 0.00 5.1s 100.0%

loghunt

https://github.com/iafisher/llm-evals/tree/master/loghunt

In loghunt, the LLM looks through 1,000 lines of webserver logs to find evidence of suspicious activity.

ChatGPT wrote the Python script that generated the log file.

Model Best Worst Mean Mean duration Malformed
anthropic-claude-sonnet-4-20250514 7 / 7 7 / 7 7.00 11.7s 0.0%
anthropic-claude-opus-4-20250514 7 / 7 7 / 7 7.00 18.6s 0.0%
openai-gpt-4o 7 / 7 7 / 7 7.00 19.2s 0.0%
google-gemini-2.5-pro 7 / 7 7 / 7 7.00 24.4s 0.0%
openai-gpt-4.1-mini 7 / 7 5 / 7 6.00 16.4s 0.0%
google-gemini-2.5-flash 7 / 7 0 / 7 4.67 16.0s 33.3%
anthropic-claude-3-5-haiku-20241022 5 / 7 0 / 7 1.67 15.2s 66.7%
google-gemini-2.5-flash-lite-preview-06-17 1 / 7 1 / 7 1.00 3.3s 0.0%
openai-gpt-4.1-nano 1 / 7 0 / 7 0.67 1.9s 33.3%

mdparse

https://github.com/iafisher/llm-evals/tree/master/mdparse

mdparse is a coding eval – write a moderately complex Python function to parse Markdown. I included a lot of specifics with how to deal with edge cases (whitespace preservation, errant title headers, etc.) to give the LLM a fair chance to pass the full test suite. The test suite itself was not given.

ChatGPT wrote most of the test cases.

Model Best Worst Mean Mean duration Malformed
google-gemini-2.5-flash-lite-preview-06-17 13 / 13 7 / 13 10.67 7.1s 0.0%
openai-gpt-4o 12 / 13 10 / 13 11.00 8.2s 0.0%
anthropic-claude-opus-4-20250514 12 / 13 9 / 13 10.00 19.6s 0.0%
google-gemini-2.5-pro 12 / 13 1 / 13 8.33 30.7s 0.0%
openai-gpt-4.1-nano 12 / 13 0 / 13 8.00 4.7s 0.0%
google-gemini-2.5-flash 12 / 13 0 / 13 5.33 15.5s 0.0%
openai-gpt-4.1-mini 11 / 13 9 / 13 10.33 21.8s 0.0%
anthropic-claude-sonnet-4-20250514 11 / 13 8 / 13 9.33 17.6s 0.0%
anthropic-claude-3-5-haiku-20241022 8 / 13 4 / 13 5.67 13.6s 0.0%

proofread

https://github.com/iafisher/llm-evals/tree/master/proofread

proofread tests the model's facility with English prose: the LLM reads a draft of one of my blog posts in which I introduced a few intentional errors.

Model Best Worst Mean Mean duration Malformed
anthropic-claude-sonnet-4-20250514 5 / 5 5 / 5 5.00 5.4s 0.0%
google-gemini-2.5-pro 5 / 5 5 / 5 5.00 15.3s 0.0%
google-gemini-2.5-flash 5 / 5 3 / 5 4.00 1.4s 0.0%
google-gemini-2.5-flash-lite-preview-06-17 5 / 5 2 / 5 4.00 3.5s 0.0%
openai-gpt-4o 4 / 5 3 / 5 3.67 1.9s 0.0%
anthropic-claude-opus-4-20250514 3 / 5 3 / 5 3.00 11.7s 0.0%
openai-gpt-4.1-nano 2 / 5 2 / 5 2.00 0.7s 0.0%
openai-gpt-4.1-mini 2 / 5 0 / 5 1.33 0.9s 33.3%
anthropic-claude-3-5-haiku-20241022 0 / 5 0 / 5 0.00 2.4s 100.0%

sqlquery

https://github.com/iafisher/llm-evals/tree/master/sqlquery

sqlquery gives the LLM a SQL schema and asks it to write a query from a detailed prose description. The eval runs it against an in-memory SQLite database populated with test data.

ChatGPT helped me with both the prompt, the solution, and the test data, so perhaps it's not a surprise that most of the models crushed this benchmark.

Model Best Worst Mean Mean duration Malformed
google-gemini-2.5-flash-lite-preview-06-17 6 / 6 6 / 6 6.00 3.7s 0.0%
openai-gpt-4o 6 / 6 6 / 6 6.00 6.6s 0.0%
openai-gpt-4.1-mini 6 / 6 6 / 6 6.00 7.3s 0.0%
anthropic-claude-sonnet-4-20250514 6 / 6 6 / 6 6.00 8.0s 0.0%
google-gemini-2.5-flash 6 / 6 6 / 6 6.00 8.6s 0.0%
anthropic-claude-opus-4-20250514 6 / 6 6 / 6 6.00 13.1s 0.0%
google-gemini-2.5-pro 6 / 6 6 / 6 6.00 22.0s 0.0%
openai-gpt-4.1-nano 3 / 6 3 / 6 3.00 1.5s 0.0%
anthropic-claude-3-5-haiku-20241022 3 / 6 3 / 6 3.00 5.9s 0.0%

Technical details

Each eval was run for each model 3 times. I set temperature to 0.7 and max tokens to 4,000 for all models. For Gemini, I set the thinking budget to be 2,000 tokens. With the default 'dynamic' thinking budget, it would frequently use up all its tokens thinking and stop in the middle of the final output.

'Malformed' responses could not be parsed to extract the actual solution (e.g., last line was not JSON, or JSON was invalid). The system prompts always give clear instructions on how to format the output, but even the best models sometimes stray. I tried to write the scoring scripts to be liberal in what they accepted.

Only mdparse and sqlquery involved actually executing LLM-written code, but I ran all evals with sandbox-exec (macOS) to be safe.