Filter duplicates in SQL, simplify your code
I have a daily job that imports articles to read from a few sources: Chrome unread bookmarks, RSS feeds, etc.
I don't ever want to import the same URL twice.
You can do this in Python:
existing_urls = fetch_existing_urls()
bookmarks = [
bookmark for bookmark in bookmarks if bookmark.url not in existing_urls
]
insert_bookmarks(bookmarks)
But it can be done natively in SQL, by creating a UNIQUE
index on the url
column, and using ON CONFLICT
in your INSERT
statement:
INSERT INTO bookmarks(...) VALUES(...)
ON CONFLICT (url) DO NOTHING
RETURNING url
The RETURNING
clause allows you to retrieve the rows that were inserted, albeit a little awkwardly: (using psycopg
)
cursor.executemany("INSERT ...", params, returning=True)
results = []
while cursor.nextset():
r = cursor.fetchone()
if r is None:
continue # skipped insertion due to conflict
results.append(r)
Doing it in SQL instead of Python has a few advantages:
- The database guarantees you will never have insert duplicates.
- You don't have to write
fetch_existing_urls
and remember to call it everywhere you insert rows. - You do one efficient batch insertion and avoid a separate
SELECT
query.