notes > 2025-10-epoll-io-uring

Notes on epoll and io_uring

epoll

epoll is a set of Linux syscalls for asynchronous I/O. It's most useful for writing network servers – unlike with blocking I/O, high performance can be achieved in single threaded programs.

The basic structure of an epoll loop (error handling omitted):

// create the epoll instance
int epollfd = epoll_create(...);

// register file descriptors of interest
epoll_ctl(epollfd, EPOLL_CTL_ADD, fd, event);

// main event loop
struct epoll_events events[MAX_EVENTS];
while (1) {
  int count = epoll_wait(epollfd, events, MAX_EVENTS, -1);
  for (int i = 0; i < count; i++) {
    struct epoll_event event = events[i];
    int fd = event.data.fd;
    // check fd to decide what to do (e.g., call `read`)
  }
}

epoll just notifies when a file descriptor is ready to be read from or written to; you still have to do the actual read or write call yourself.

Simple example of a TCP echo server here.

A few subtler points:

If you intend to handle signals via the self-pipe trick, make sure to check if epoll_wait returns EINTR and continue the loop if so.
When a file descriptor is ready, you should loop over it calling read (or accept or whatever) until it returns EAGAIN or EWOULDBLOCK (which requires you to set it to non-blocking to begin with). Otherwise, if you had, say, 10 queued connections, you'd end up calling epoll_wait 10 times instead of once.
For listening on sockets, it's a good idea to pass in EPOLLRDHUP and EPOLLHUP along with EPOLLIN.

io_uring

io_uring is a Linux kernel interface for making asynchronous syscalls. It aims to be performant: ring buffers shared between userspace and the kernel reduce syscall overhead, and multiple syscall requests can be dispatched at once.

Unlike epoll, the kernel will actually do the syscall for you, instead of just notifying you when the syscall is ready to be done.

At the syscall level, you call io_uring_setup and then mmap to set up the data structures, and io_uring_enter to submit requests to the kernel. But usually you use liburing rather than the low-level syscalls, unless you want to deal with the ring buffers and memory barriers yourself.

If you do want to deal with the ring buffers and memory barriers yourself, here are a few points of confusion that I struggled over:

ring_entries is the size of the buffer; ring_mask is always equal to ring_entries - 1. This is so that you can efficiently compute an index that possibly overflowed the end of the buffer with index & ring_mask which is equivalent to index % ring_entries (assuming that ring_entries is a power of 2) but much more efficient in the hardware.
How is it possible that (on newer kernels) you can map both the submission queue and the completion queue in one call to mmap, passing in max(sring_sz, cring_sz) instead of sring_sz + cringsz? The answer is that you do not access fields on the queues through hard-coded offsets, but rather through the offsets specified in the sq_off and cq_off structs, which are populated dynamically by io_uring_setup. sring_sz and cring_sz are calculated using these offsets. You can imagine that the mmap lays out the submission queue followed by the completion queue in one big chunk of memory (in reality it seems that the fields of the two queues are interleaved). So the kernel just populates the offsets structs with the correct offsets and everything works transparently.
Why are memory barriers necessary? Because userspace and the kernel share could be accessing the buffer simultaneously, and you need to ensure that the compiler doesn't reorder memory reads or writes in an invalid fashion. Why don't we need locks? Well, if you have multiple threads accessing the queues from userspace, you might need locks. Otherwise, though, userspace and the kernel are never writing to the same queue – only userspace writes to the submission queue and only the kernel writes to the completion queue. Writing to a queue can be done atomically by incrementing an aligned 64-bit integer, so no additional synchronization is needed. (Before incrementing, you need to prepare the submission queue entry, which takes multiple instructions and isn't atomic, but that's only a problem if you have multiple userspace threads writing to the queue.)

Table of contents

Notes on epoll and io_uring

epoll

Further reading

io_uring

Further reading