Notes on epoll and io_uring
epoll
epoll is a set of Linux syscalls for asynchronous I/O. It's most useful for writing network servers – unlike with blocking I/O, high performance can be achieved in single threaded programs.
The basic structure of an epoll loop (error handling omitted):
// create the epoll instance
int epollfd = epoll_create(...);
// register file descriptors of interest
epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, event);
// main event loop
struct epoll_events events[MAX_EVENTS];
while (1) {
int count = epoll_wait(epollfd, events, MAX_EVENTS, -1);
for (int i = 0; i < count; i++) {
struct epoll_event event = events[i];
int fd = event.data.fd;
// check fd to decide what to do (e.g., call `read`)
}
}
epoll just notifies when a file descriptor is ready to be read from or written to; you still have to do the actual read or write call yourself.
Simple example of a TCP echo server here.
A few subtler points:
- If you intend to handle signals via the self-pipe trick, make sure to check if
epoll_waitreturnsEINTRand continue the loop if so. - When a file descriptor is ready, you should loop over it calling
read(oracceptor whatever) until it returnsEAGAINorEWOULDBLOCK(which requires you to set it to non-blocking to begin with). Otherwise, if you had, say, 10 queued connections, you'd end up callingepoll_wait10 times instead of once. - For listening on sockets, it's a good idea to pass in
EPOLLRDHUPandEPOLLHUPalong withEPOLLIN.
Further reading
- Epoll is fundamentally broken 1/2 (2017) – Using
epollsafely when multithreading or multiprocessing is hard.
io_uring
io_uring is a Linux kernel interface for making asynchronous syscalls. It aims to be performant: ring buffers shared between userspace and the kernel reduce syscall overhead, and multiple syscall requests can be dispatched at once.
Unlike epoll, the kernel will actually do the syscall for you, instead of just notifying you when the syscall is ready to be done.
At the syscall level, you call io_uring_setup and then mmap to set up the data structures, and io_uring_enter to submit requests to the kernel. But usually you use liburing rather than the low-level syscalls, unless you want to deal with the ring buffers and memory barriers yourself.
If you do want to deal with the ring buffers and memory barriers yourself, here are a few points of confusion that I struggled over:
ring_entriesis the size of the buffer;ring_maskis always equal toring_entries - 1. This is so that you can efficiently compute an index that possibly overflowed the end of the buffer withindex & ring_maskwhich is equivalent toindex % ring_entries(assuming thatring_entriesis a power of 2) but much more efficient in the hardware.- How is it possible that (on newer kernels) you can map both the submission queue and the completion queue in one call to
mmap, passing inmax(sring_sz, cring_sz)instead ofsring_sz + cringsz? The answer is that you do not access fields on the queues through hard-coded offsets, but rather through the offsets specified in thesq_offandcq_offstructs, which are populated dynamically byio_uring_setup.sring_szandcring_szare calculated using these offsets. You can imagine that themmaplays out the submission queue followed by the completion queue in one big chunk of memory (in reality it seems that the fields of the two queues are interleaved). So the kernel just populates the offsets structs with the correct offsets and everything works transparently. - Why are memory barriers necessary? Because userspace and the kernel share could be accessing the buffer simultaneously, and you need to ensure that the compiler doesn't reorder memory reads or writes in an invalid fashion. Why don't we need locks? Well, if you have multiple threads accessing the queues from userspace, you might need locks. Otherwise, though, userspace and the kernel are never writing to the same queue – only userspace writes to the submission queue and only the kernel writes to the completion queue. Writing to a queue can be done atomically by incrementing an aligned 64-bit integer, so no additional synchronization is needed. (Before incrementing, you need to prepare the submission queue entry, which takes multiple instructions and isn't atomic, but that's only a problem if you have multiple userspace threads writing to the queue.)
Further reading
Reference:
- "Lord of the io_uring" (Shuveb Hussain, 2020) – site with detailed guides and examples
- "Io uring" (Nick Black)
LWN coverage:
- "Ringing in a new asynchronous I/O API" (Jonathan Corbet, 2019)
- "The rapid growth of io_uring" (Jonathan Corbet, 2020)
- "Introducing io_uring_spawn" (Jake Edge, 2022)
- "A generic ring buffer for the kernel" (Jonathan Corbet, 2024) – not specific to
io_uringbut a good overview of ring buffers
Case studies:
- "lsr: ls but with io_uring" (Tim Culverhouse, 2025)
- "io_uring, kTLS and Rust for zero syscall HTTPS server" (Thomas Habets, 2025)
- "io_uring new uSockets backend" (2023) – The µWebSockets C++ project saw significant performance gains from switching from
epolltoio_uring. - "A Programmer-Friendly I/O Abstraction Over io_uring and kqueue" (King Butcher and Phil Eaton @ TigerBeetle, 2022) – a Zig I/O dispatching library built on top of
io_uring