home blog portfolio Ian Fisher

Filter duplicates in SQL, simplify your code

I have a daily job that imports articles to read from a few sources: Chrome unread bookmarks, RSS feeds, etc.

I don't ever want to import the same URL twice.

You can do this in Python:

existing_urls = fetch_existing_urls()
bookmarks = [
    bookmark for bookmark in bookmarks if bookmark.url not in existing_urls
]
insert_bookmarks(bookmarks)

But it can be done natively in SQL, by creating a UNIQUE index on the url column, and using ON CONFLICT in your INSERT statement:

INSERT INTO bookmarks(...) VALUES(...)
ON CONFLICT (url) DO NOTHING
RETURNING url

The RETURNING clause allows you to retrieve the rows that were inserted, albeit a little awkwardly: (using psycopg)

cursor.executemany("INSERT ...", params, returning=True)
results = []
while cursor.nextset():
    r = cursor.fetchone()
    if r is None:
        continue  # skipped insertion due to conflict
    results.append(r)

Doing it in SQL instead of Python has a few advantages: