Why is epoll faster than select?
I have seen a lot of comparisons which says select have to walk through the fd list, and this is slow. But why doesn't epoll have to do this?
There's a lot of misinformation about this, but the real reason is this:
A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.
With select
, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.
By contrast, with epoll
, the epoll
socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.
To be clear, with epoll
, the epoll
socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select
that blocks must add the process to every wait queue for every socket being monitored.
Ironically, with select
, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll
, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll
socket when that activity happened. In a sense, select
polls each socket each time you call select
to see if there's any activity while epoll
rigs it so that the socket activity itself notifies the process.
The main difference between epoll
and select
is that in select()
the list of file descriptors to wait on only exists for the duration of a single select()
call, and the calling task only stays on the sockets' wait queues for the duration of a single call. In epoll
, on the other hand, you create a single file descriptor that aggregates events from multiple other file descriptors you want to wait on, and so the list of monitored fd's is long-lasting, and tasks stay on socket wait queues across multiple system calls. Furthermore, since an epoll
fd can be shared across multiple tasks, it is no longer a single task on the wait queue, but a structure that itself contains another wait queue, containing all processes currently waiting on the epoll
fd. (In terms of implementation, this is abstracted over by the sockets' wait queues holding a function pointer and a void*
data pointer to pass to that function).
So, to explain the mechanics a little more:
- An
epoll
file descriptor has a privatestruct eventpoll
that keeps track of which fd's are attached to this fd.struct eventpoll
also has a wait queue that keeps track of all processes that are currentlyepoll_wait
ing on this fd.struct epoll
also has a list of all file descriptors that are currently available for reading or writing. - When you add a file descriptor to an
epoll
fd usingepoll_ctl()
,epoll
adds thestruct eventpoll
to that fd's wait queue. It also checks if the fd is currently ready for processing and adds it to the ready list, if so. - When you wait on an
epoll
fd usingepoll_wait
, the kernel first checks the ready list, and returns immediately if any file descriptors are already ready. If not, it adds itself to the single wait queue insidestruct eventpoll
, and goes to sleep. - When an event occurs on a socket that is being
epoll()
ed, it calls theepoll
callback, which adds the file descriptor to the ready list, and also wakes up any waiters that are currently waiting on thatstruct eventpoll
.
Obviously, a lot of careful locking is needed on struct eventpoll
and the various lists and wait queues, but that's an implementation detail.
The important thing to note is that at no point above there did I describe a step that loops over all file descriptors of interest. By being entirely event-based and by using a long-lasting set of fd's and a ready list, epoll can avoid ever taking O(n) time for an operation, where n is the number of file descriptors being monitored.