LLM impressions

You are viewing the Markdown source of this page. Return to the main page.
# LLM impressions
More systematic investigations at ["My personal LLM evals"](/notes/2025-07-my-personal-llm-evals).

## Jun 2025: Impressive: Extracting information from a webpage
- Model: **ChatGPT o3**

I asked ChatGPT to extract a list of Israeli settlement names from [List of Israeli settlements](https://en.wikipedia.org/wiki/List_of_Israeli_settlements) (providing only the URL) and format it as a Django ORM query.

## Jun 2025: Poor: Writing a script to use the Spotify API
- Model: **ChatGPT o4-mini-high**

I asked ChatGPT to write a script to fetch a Spotify playlist and print the title and artist of each track. The script worked for the first playlist I tried, but failed for a different playlist, and despite many rounds of back-and-forth and increasingly complicated fixes, it was unable to get the script working.

## Jun 2025: Impressive: Writing a tricky function
- Model: **ChatGPT 4o**

I gave ChatGPT a fully-implemented but incorrect `get_backups_to_keep` function, which described in a comment the backup policy (e.g., keep the last 3 days, the first day of the last 3 months, etc.). ChatGPT modified the implementation to match the description (I didn't have to describe how the existing implementation was incorrect). I then gave it a test case and told it to make the function pass the test case. It modified the function in the right way, though because of a small off-by-one error it didn't actually pass the test case.

Finally, I told it:

> Write a test case that steps through each day of the year, calls get_backups_to_keep, and ensures that the only file in the difference `today - yesterday` is today's file.

The test case was basically correct, again except for a small error with the initial condition that was trivial to fix.

I was especially impressed because:

- This was just in a chat session and the LLM had no ability to run code.
- I was using my own `timehelper.last_month` API and I didn't provide the implementation.

## Jun 2025: Good: Setting up a database
- Model: **ChatGPT 4o**

I asked ChatGPT to help me set up a Postgres database and connect to it over a Tailscale network. It gave me correct commands, and whenever I ran into an issue it was able to help me resolve it.

## Jun 2025: Poor: Finding a webpage
- Model: **ChatGPT o3**

Query: "Find me a link to a recent summer lookbook for men from J. Crew"

It gave me two broken links to the J. Crew website, and a third link to an external blog which was a real page but for women's fashion.

## May 2025: Mixed: Trip planning
- Model: **ChatGPT o3**

I asked ChatGPT various questions about planning a trip, e.g. "Put together an itinerary for one day in X doing Y and Z activities."

## Apr 2025: Poor: Writing an Obsidian plugin
- Model: **ChatGPT 4o**

Query: "write an obsidian plugin that lets me register shell commands to run inside my Obsidian vault via an option in the command palette menu"

It understood the prompt but generated code that was broken in a few ways. It was somewhat helpful in finding the right APIs to use from Obsidian's TypeScript interface but it required non-trivial edits to get it to work.

## Sep 2024: Impressive: 401(k) analysis
- Model: **ChatGPT o1-preview**

I asked it to analyze whether a Roth or traditional 401(k) would be better, given some input parameters. It gave (as best as I can tell) a correct analysis, and included nuances like the investment growth of tax money saved (and the capital gains tax on that growth), and the fact that employer contributions are always pre-tax even for a Roth account. It also gave an alternative analysis for a different tax rate.

*However*, in its explanation of the formula it did the math wrong: it said `1.07^43` was `13.286` when really it is `18.344`.

## Sep 2024: Poor: Elementary math error
- Model: **ChatGPT 4o mini**

I asked it `1.07^43` and it gave a completely wrong answer and insisted it was right.

## Aug 2024: Poor: Fixing a bug
- Model: **ChatGPT 4o mini**

Query: I asked it to fix a real bug that I couldn't figure out in a Python program (having to do with setting a global variable.

It suggested adding a line which already existed in my code.

## July 2024: Fair: Pass a test suite
- Model: **ChatGPT 4o mini**

Query: I gave it a set of Go function signatures and a test case to pass. I didn't otherwise describe what the code should do.

It produced reasonable code, but it didn't pass the test case.

## June 2024: Poor: Chinese grammar
- Model: **ChatGPT 4o**

Query: "Does 多 go before or after the measure word in Chinese?"

Its response was self-contradictory: it said it goes after the measure word but gave an example where it comes before the measure word.

## May 2024: Fair: Poetry analysis
- Model: **ChatGPT 4o**

Query: "Explain the syntax of 'Dun Scotus's Oxford' by Gerard Manley Hopkins" (preceded by the text of the poem)

Response was well-structured and it picked up on some subtle things like the illusion to the Virgin Mary, but it also made some mistakes and didn't explain some of the trickier syntax; in follow-up, it didn't give a convincing explanation of the phrase "rural rural keeping" from the poem.

Query: "Explain the syntax of these lines from a poem by Yeats" (the poem was "The Rose of the World"

It got a lot of it right but I think it made a mistake.

## May 2024: Good: Practice problems
- Model: **ChatGPT 3.5**

Query: "Give me some practice problems for computing the dot product of two vectors."

Generated valid if somewhat repetitive problems, and gave me the correct answers when I asked.

## May 2024: Good: Finding a bug
- Model: **ChatGPT 3.5**

I gave it a TypeScript function that I knew had [a bug](https://github.com/iafisher/cityquiz/commit/641d9700a75ecf282186b829d841e193f119eac2) in it, and it found it.

## May 2024: Good: Google Sheets formula
- Model: **ChatGPT 3.5**

Query: "sum up a column in Google Sheets where an adjacent column has a certain label"

## May 2024: Fair: Defining terms
- Model: **ChatGPT 3.5**

Query: "Explain the difference between gradient descent and backpropagation in machine learning." (2024-05-04)

It defined the two terms though didn't give an explanation of their difference.

## May 2024: Good: Logical reasoning
- Model: **ChatGPT 3.5**

Query: "Is it possible that Sir Robert Peel, the British PM, ever met Daniel Defoe, the writer?"

It correctly stated that their lifetimes did not overlap.

However in follow-up questions it stated that it is physically possible (though unlikely) that they could have corresponded by letter.

## May 2024: Poor: Conceptual questions
- Model: **ChatGPT 3.5**

Query: "Why is it that market makers are often prop firms and not hedge funds?"

Explanation is vague and not really logical – it says market making requires "significant capital" but doesn't explain why this is an advantage for prop firms.

## May 2024: Good: Color picking
- Model: **ChatGPT 3.5**

Query: "Give me some colors that would complement `#a2ebca` nicely"

I don't have enough design sense to evaluate its choices, but it understood the query ("a light aqua shade") and gave reasonable results.