Learn about I/O polling in Linux

Goal 🏁

Learn how epoll works. Educate myself on epoll, poll, select and related concepts.
Create a minimalistic application which serves as an example of how Linux epoll can be used. A good example would be a simple TCP sever using socket(2). Server waits until there is some data to be read and then writes a response. It should be a similar application to the one presented by Sam Roberts (see #Sources).

Why should you care? 🤷‍♂️

I work a lot with Node.js. As you may or may not know, under the hood Node.js uses libuv to manage it's asynchronous I/O operations. It's just one of those things that make Node.js tick. On the libuv homepage, we may find the list of libuv's key features. One of them is:

Full-featured event loop backed by epoll, kqueue, IOCP, event ports.

OK... so there is something like epoll, but what the heck is kqueue or IOCP???

libuv is a multi-platform support library with a focus on asynchronous I/O.

libuv supports many platforms. epoll itself is a Linux-only feature. BSD uses kqueue and Windows IOCP.

The event loop follows the rather usual single threaded asynchronous I/O approach: all (network) I/O is performed on non-blocking sockets which are polled using the best mechanism available on the given platform: epoll on Linux, kqueue on OSX and other BSDs, event ports on SunOS and IOCP on Windows. As part of a loop iteration the loop will block waiting for I/O activity on sockets which have been added to the poller and callbacks will be fired indicating socket conditions (readable, writable hangup) so handles can read, write or perform the desired I/O operation.

👀 You can check it in the source code.

Or check out this simple example:

// print.js

async function waitForever() {
  while (true) {
    console.log("Linux is awesome!");
    await new Promise(resolve => {
      setTimeout(resolve, 5000);
    });
  }
}

waitForever();

Run this simple script with node print.js and you will see "Linux is awesome!" logged each five seconds. Now let's look at the system calls this program makes. Run the same script using strace(1)

$ strace node print.js
execve("/usr/bin/node", ["node", "print.js"], [/* 27 vars */]) = 0
# MORE SYSCALLS...
epoll_create1(EPOLL_CLOEXEC)            = 3
# More SYSCALLS...
open("/home/pi/Desktop/npm-test/print.js", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 12
fstat64(12, {st_mode=S_IFREG|0644, st_size=400, ...}) = 0
read(12, "async function waitForever() {\n "..., 400) = 400
close(12)                               = 0
write(9, "Linux is awesome...\n", 20Linux is awesome...
)   = 20
cacheflush(0x59b07680, 0x59b077a0, 0, 0x1, 0x7efc75d8) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=42949672966}}) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=42949672968}}) = 0
epoll_pwait(3, [{EPOLLIN, {u32=8, u64=42949672968}}], 1024, 4996, NULL, 8) = 1
read(8, "\2\0\0\0\0\0\0\0", 1024)       = 8
epoll_pwait(3, [], 1024, 4996, NULL, 8) = 0
write(9, "Linux is awesome...\n", 20Linux is awesome...
)   = 20
# THEN EACH 5 SECONDS
epoll_pwait(3, [], 1024, 0, NULL, 8)    = 0
epoll_pwait(3, [], 1024, 5000, NULL, 8) = 0
write(9, "Linux is awesome...\n", 20Linux is awesome...
)   = 20

So if you wish to understand what makes the Node.js' event loop work and you are a Linux enthusiast, you need to learn about epoll.

So what is epoll? 🤨

If you're looking for a one-sentence explanation then you can think about epoll as of a Linux kernel system call which allows managing I/O operations based on an event notification mechanism.

It's difficult to understand how the epoll works without first getting a good grasp of the idea of file descriptors.

A good way to begin the adventure with epoll is to simply read the docs. Manpage for epoll provides a lot of insight. (Yeah, I know. Obvious. Yet still it surprises people 🤷‍♂️)

We can find there that epoll can monitor multiple file descriptors to see if I/O operations are possible on any of them. Reading further we find that:

The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors.

Ok. So from the programmer perspective, epoll can work in two ways: edge-triggered or a level-triggered interface. This will become important later on. We also get information that it "scales well". Spoiler alert ⚠️ - when it comes to monitoring large number of file descriptors, epoll will perform much better than poll(2) or select(2).

API

In order to use epoll API, remember to include the <sys/epoll.h> header.

`epoll_create` and `epoll_create1`

There are two functions that can "create" a new epoll instance. Or as the manual says "open an epoll file descriptor". When epoll_create or epoll_create1 is called, the kernel will create a new instance of epoll - a special data structure inside the kernel.

The file descriptor to an epoll instance can be used to add, remove, or modify file descriptors that we want to monitor for I/O.

Since Linux 2.6.8 the size argument passed to epoll_create is ignored and the underlying data structure dynamically resizes. However, size has to be greater than 0. That is for backward compatibility reasons. See epoll_create(2) for more details.

int epoll_create(int size);
int epoll_create1(int flags);

What is the difference between the two? epoll_create1 allows you to pass the EPOLL_CLOEXEC (close on exec) flag. This basically means that when fork(2) will be called, the child process will close the epoll file descriptor before it exec(2) is called.

On error, epoll_create will return -1. Manpage epoll_create(2) enumerates all possible errno values. I'll mention only one - EMFILE. It means that we've hit the per-user limit of epoll instances. You can check the value of that limit with

cat /proc/sys/fs/epoll/max_user_watches

`epoll_ctl`

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

The first argument epfd is the file descriptor of the epoll instance created with epoll_create.

Then in op (operation) we have to set a flag which actually determines what we're trying to do. We can add, delete, or modify file descriptors on epoll's list of interest. Then we pass the actual file descriptor fd that we want to "observe".

Finally we have to pass a pointer to an epoll_event structure. Following the man page we can see that epoll_event structure looks like that:

struct epoll_event {
    uint32_t events;          /* Epoll events */
    epoll_data_t data;        /* User data variable */
};

typedef union epoll_data {
    void        *ptr;
    int          fd;
    uint32_t     u32;
    uint64_t     u64;
} epoll_data_t;

The events is yet another time a mask which defines in what sort of events are we interested.

`epoll_wait`

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Here, again we have to pass the file descriptor of the epoll instance. The second argument is an array of instances of the epoll_event structures. This is where the events available for the caller will be stored. The maxevents specifies the max number of events stored in *events. You can think of *events and maxevents as of an array and its length.

The timeout is the number of milliseconds that epoll_wait will block. If timeout is set to -1, epoll_wait waits indefinitely. In case of timeout being set to 0, epoll_wait will return mediately even if there aren't any events available.

It is worth to point out that the data field of the epoll_event will have the same data that was set with epoll_ctl.

In case of success, epoll_wait will return the number of file descriptors ready for requested I/O operation. (👉 Remember that in epoll_event we had to specify what type of I/O we're interested in.) In case of an error the value of -1 is returned.

Things to 🐻 in 🧠

There are a few tricky things about epoll.

Mind the difference between a file descriptor and file description

This is where good understanding of the concept of a file descriptor comes in handy.

A few words on file descriptors 🔬

When we think about file descriptors, we usually think about some abstract handle that we use to access a file (or a pipe, socket, etc.). And this already is a good intuition. It is something abstract. Physically its a non-negative integer, generally represented as int type. Usually when created (eg. by open(2) syscall) it will be the lowest (lowest number) file descriptor not currently open for the process.

When open(2) is called a file descriptor is returned but underneath a new file description is created. File description is a system-wide table of open files, sometimes referred to as file table. The file description keeps information of the file status flags, access mode (e.g. write-only, read-only, read-write), the opened file's, reference to inode for that file. offset. Quoting the Linux manual: A file descriptor is a reference to an open file description.

This is very important for a few reasons. For instance, when file's path is removed or changed (e.g. to refer to a different file) it does not affect the file descriptor - file description reference.

Keep in mind that many file descriptors can point to a single file description. This is for instance when a file descriptor is duplicated using dup(2) syscall. The duplicated file descriptor refers to the same open file description. As a consequence, both file descriptors will share the file's offset and status flags. The same behavior can be observed between two processes. A child process created with fork(2) will inherit duplicates of the parent's file descriptors. This means that child's file descriptors will refer to the same file descriptions as parent's file descriptors. This is why sometimes we want to mark our file descriptors with O_CLOEXEC flag, so that file descriptors are closed once a forked process execs.

Note that each time we call open(2) a new file description is created. This means that there can be many file descriptions pointing to the same inode.

Figure from "The Linux Programming Interface" by Michael Kerrisk:

👉 On Linux you can see open file descriptors of a process in /proc/[pid]/fd. In /proc/[pid]/fdinfo you can find the value of offset, flags (status flags), and mnt_id (mount point).

When is a file descriptor closed?

When close(2) is called
When process exits
After exec(2) when the descriptor is marked with O_CLOEXEC (close on exec)

🔥 Now time for the bombshell - epoll does not track the per-process file descriptors. Instead it tracks the underlying file description. This has an important consequence. I think it was best explained in epoll(7) manual:

Will closing a file descriptor cause it to be removed from all epoll sets automatically?

Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2), fcntl(2) > F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors referring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.

To summarize this part, we can conclude that a process will continue to receive events via epoll_wait for a file descriptor that might have been closed, as long as there the file description which it has been referencing is still referenced by at least one file descriptor.

Non-Blocking file descriptors

When reading (or any other I/O operation) on a file descriptor when there is no data will block. A file descriptor can be put in a non-blocking mode. This way any I/O operations on such file descriptor will not cause the process to wait.

Level-triggered and Edge-triggered

In a nutshell the difference can be boiled down to:

Level-triggering - A file descriptor is considered ready when it is possible to perform I/O operation without blocking
Edge-triggering - We are notified when there was some I/O activity on a given file descriptor since it was last checked (e.g. there is some new data pending to be read).

With level-triggering we can always check for a file descriptors readiness, whereas with edge-triggering there has to be a "change" (edge). Because of that we can repeat the level-triggered check calls many times. The reason for such approach might be that we do not want to read all the available data at once, but in smaller chunks. The edge-triggered notification model discourages us from such approach. In fact, we might be encouraged to do "as much I/O as possible" whenever the notification comes.

It is the easiest to see the difference on an example. Imagine that we're monitoring a socket file descriptor. Some input data arrives on the socket. We call epoll_wait. Then we call epoll_wait again. On the first call we will get the information that the socket file descriptor is ready regardless of whether the level-triggering or edge-triggering was used. However, on the second call only when level-triggering is used we are going to get the file descriptor "ready" notification. With edge-triggering, the second call will block because there was no new input data on the socket.

This is also why edge-triggering is used with non-blocking file descriptors. In this approach, whenever epoll_waits returns a list of "ready" file descriptors, each of them should be processed by the appropriate I/O function until EAGAIN is returned. With edge-triggering approach we must also prevent the so called file descriptor starvation.

Also, check the epoll(7) manpage. There is a very good use-case example which describes the difference.

Is epoll better than poll or select? 🤔

TLDR; Yes, it is. But it ain't perfect either.

The computational complexity of poll and select is O(N), where N is the number of file descriptors to be monitored. Each time you call poll(2) or select(2) the kernel has to check every file descriptor on the list. This is why we'd want to use poll(2) and select(2) only when the number of file descriptors to monitor is small. Another slowdown for these two syscalls is that each time they are called a data structure describing file descriptors to be monitored needs to be passed to the kernel and then the kernel needs to return a modified version of that data structure. Then a program calling select(2) or poll(2) needs to check which file descriptor is ready in the structure returned by the kernel.

On the other hand, the complexity of epoll is O(M), where M is the number of events that have occurred. When we create our "interest list" with epoll_ctl, we do it only once. The kernel records that in a list associated with the underlying file description. Whenever there is some I/O operation that makes the monitored file descriptor ready, the kernel will add an element to the "ready" list for a given epoll file descriptor. Then whenever epoll_wait is called, it only gets items from the "ready" list. By contrast to select(2) and poll(2), epoll creates a data structure in the kernel (in kernel space). Once such data structure is created, subsequent calls to epoll_wait do not have to pass any information about monitored file descriptors to the kernel. Also, kernel returns information only about file descriptors that are ready.

Example application 👾

In this repo you will find an implementation of a simple "server" built around socket(2). There are three implementations - using poll(2), select(2), and epoll(7).

The application is very simple and aims to show the differences in the API. The application creates a socket and binds it to the localhost (you have to specify the port number as an argument). Once a connection is requested, accept(2) is called. When there is data ready to be read on client's socket, it is read to a buffer and the server writes a response message.

Sources 📚

This repo is inspired by a number of other (much better) sources:

Linux man pages 🐧
"The Linux Programming Interface" by Michael Kerrisk
Example presented by Sam Roberts in Node's Event Loop From the Inside Out
Awesome articles on Medium by Cindy Sridharan - Nonblocking I/O and The method to epoll madness
A brief history of select(2)
Select is fundamentally broken
Epoll is fundamentally broken part 1 and part 2

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
docs		docs
.clang-format		.clang-format
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
common.c		common.c
common.h		common.h
epoll_socket.c		epoll_socket.c
format-code.sh		format-code.sh
poll_socket.c		poll_socket.c
pretty-markdown.sh		pretty-markdown.sh
select_socket.c		select_socket.c

pniewiejski/learn-io-polling

Folders and files

Latest commit

History

Repository files navigation

Learn about I/O polling in Linux

Goal 🏁

Why should you care? 🤷‍♂️

So what is epoll? 🤨

API

epoll_create and epoll_create1

epoll_ctl

epoll_wait

Things to 🐻 in 🧠

Mind the difference between a file descriptor and file description

A few words on file descriptors 🔬

Non-Blocking file descriptors

Level-triggered and Edge-triggered

Is epoll better than poll or select? 🤔

Example application 👾

Sources 📚

About

Topics

Resources

Stars

Watchers

Forks

Languages

`epoll_create` and `epoll_create1`

`epoll_ctl`

`epoll_wait`