by Arpit Kumar
07 Apr, 2024
13 minute read
Understanding Asynchronous I/O in Linux - io_uring

Explore the evolution of I/O multiplexing from select(2) to epoll(7), culminating in the advanced io_uring framework


Input/Output (I/O) operations are fundamental processes in computer systems. These operations facilitate data transfer between the central processing unit (CPU) and external devices such as disk drives, network interfaces, keyboards, and displays. By establishing communication between the CPU and these peripherals, I/O operations ensure efficient interaction and data exchange, contributing to the overall performance and functionality of the computer system.

Traditional I/O methods and their limitations

Traditionally, I/O operations in Linux and other operating systems have been managed using synchronous and blocking system calls such as read() and write(). In these methods, when an application requests an I/O operation, the system halts the execution of the application until the operation is completed. This synchronous blocking behavior can lead to inefficiencies, particularly in scenarios involving multiple concurrent I/O operations or when dealing with slow devices.

Additionally, traditional I/O methods often involve a significant amount of overhead, as each system call entails context switches between user space and kernel space, as well as copying data between these spaces. These inefficiencies can limit the performance and scalability of I/O-intensive applications, such as databases, web servers, and file servers.

Asynchronous I/O Model

Asynchronous I/O, also known as non-blocking I/O, is a programming model that allows applications to perform I/O operations without waiting for the operations to complete. Instead of blocking the execution of the application until the I/O operation finishes, asynchronous I/O enables the application to continue processing other tasks while the I/O operation is in progress. This model offers several benefits, including improved performance, scalability, and responsiveness.

The primary components involved in asynchronous I/O in Linux include:

File Descriptors: File descriptors represent open files, sockets, or other I/O resources. They are represented by integers and are used by the operating system to identify the resources. File descriptors are passed to system calls to perform I/O operations. Event Notification Mechanism: This mechanism notifies the application when I/O events, such as readiness to read or write, occur on a file descriptor. In Linux, this is typically achieved using mechanisms like select, poll, epoll or kqueue. These mechanisms allow applications to register interest in certain file descriptors and receive notifications when events occur.

Callback Functions or Signals: Once an I/O event occurs, the application needs a way to handle it. This is often done using callback functions or signal handlers. When an I/O event occurs, the registered callback function is invoked asynchronously to handle the event.

Kernel Space and User Space Interaction: Asynchronous I/O involves coordination between the kernel and user space. The kernel manages the actual I/O operations and notifies the user space when events occur. User space applications register their interest in certain events and provide callback functions to handle these events.

Simplified working of asynchronous I/O in Linux

Initiation: The application initiates an asynchronous I/O operation by requesting the kernel to perform an I/O operation on a specific file descriptor. Instead of blocking and waiting for the operation to complete, the application continues executing other tasks. Registration: The application registers interest in I/O events related to the file descriptor using the event notification mechanism (e.g., epoll). It specifies which events it wants to be notified about (e.g., readiness to read, readiness to write).

Asynchronous Execution: The kernel performs the I/O operation asynchronously in the background while the application continues executing other tasks. When the I/O operation completes or when certain events occur (e.g., data is available for reading, buffer space is available for writing), the kernel notifies the application.

Event Handling: Upon receiving a notification, the application invokes the registered callback function or signal handler to handle the I/O event. This allows the application to process the data, perform additional I/O operations, or take other actions as needed.

Completion: Once the application has processed the I/O event, it may choose to initiate additional asynchronous I/O operations or perform other tasks. The asynchronous I/O cycle continues until the application no longer needs to perform I/O operations on the file descriptor.

Overall, asynchronous I/O in Linux allows applications to perform non-blocking I/O operations efficiently, improving responsiveness and allowing for better utilization of system resources.

Advantages of the Asynchronous I/O Model

Improved Performance: Asynchronous I/O reduces the overhead associated with blocking operations, such as context switches and thread synchronization. By allowing the application to continue processing other tasks while waiting for I/O operations to complete, asynchronous I/O can lead to better overall performance and resource utilization.

Scalability: Asynchronous I/O enables applications to handle a large number of concurrent I/O operations without being constrained by the limitations of traditional synchronous models. This scalability is particularly beneficial for I/O-intensive applications, such as web servers, databases, and network services, which often need to handle numerous simultaneous connections or requests.

Responsive Applications: By decoupling I/O operations from the application’s main execution thread, asynchronous I/O can improve the responsiveness of the application. This allows the application to maintain a smooth user experience, even when performing I/O-bound tasks or interacting with slow or unpredictable I/O devices.

Resource Efficiency: Asynchronous I/O reduces the need for dedicated threads or processes to manage blocking I/O operations. Instead of allocating a separate thread for each I/O operation, asynchronous I/O allows applications to multiplex multiple I/O operations on a smaller number of threads, leading to better resource utilization and reduced overhead.

While asynchronous I/O offers many benefits, it also introduces complexity and challenges for application developers. Managing asynchronous I/O operations requires careful handling of callbacks, event loops, and concurrency issues. Additionally, not all operating systems and programming languages provide robust support for asynchronous I/O, which can limit its adoption in certain environments.

io_uring as a modern asynchronous I/O framework

System calls for I/O multiplexing, from the early days of select(2) to the cutting-edge io_uring, have been marked by significant advancements in efficiency, scalability, and ease of use. Let’s delve into this evolutionary path.

select(2) - synchronous I/O multiplexing

select(2) emerged in the early 1980s as one of the earliest I/O multiplexing system calls. It allowed processes to monitor multiple file descriptors for various events like readability, writability, or exceptions.

The select(2) system call in Unix-like operating systems is limited by file descriptor (FD) numbers because it relies on a bit mask to represent the file descriptors that it will monitor for activity.

File descriptors are integers that represent open files, sockets, or other I/O resources in a Unix-like operating system. The select(2) system call allows a program to monitor multiple file descriptors to see if I/O is possible on any of them without blocking.

In the select(2) call, you provide three sets of file descriptors: the set of file descriptors to monitor for read activity, the set to monitor for write activity, and the set to monitor for exceptions. These sets are represented as bit masks, with each bit corresponding to a file descriptor.

The limitation arises from the size of the bit masks used to represent the file descriptors. In many implementations, these masks have a fixed size, often defined by constants like FD_SETSIZE. This limits the number of file descriptors that can be monitored simultaneously by select(2).

The time complexity of select(2) is O(n), where n is the highest file descriptor number plus one. This means that as the number of file descriptors increases, the time taken by select(2) also increases linearly.

For example, if FD_SETSIZE is defined as 1024, then you can monitor file descriptors numbered from 0 to 1023 using select(2). If you need to monitor more file descriptors than this limit, you would typically need to use an alternative mechanism, such as poll(2) or poll(2) in Linux, which do not have this limitation..

poll(2) - wait for some event on a file descriptor

In response to the limitations of select(2), poll(2) was introduced in the early 1990s as a more advanced alternative. It addressed some of the scalability issues of select(2) by allowing processes to monitor a larger number of file descriptors.

poll(2) provided more control over the polling process and scaled better to large numbers of file descriptors compared to select(2). It still suffers from O(n) as it initializes a set of file descriptors (struct pollfd array) and specifies the events it wants to monitor for each descriptor (e.g., read, write, error).

epoll(7) - I/O event notification facility

Late 1990s saw the development of epoll(7), representing a significant leap forward in I/O multiplexing. It was designed to be more efficient and scalable than both select(2) and poll(2).

epoll(7) improves upon poll(2) in several ways, primarily addressing performance and scalability concerns:

epoll(7) uses a scalable event notification mechanism that does not require iterating over all file descriptors. Instead, it maintains a data structure that efficiently tracks and manages file descriptors, resulting in constant time complexity O(1), regardless of the number of file descriptors being monitored. It typically uses red-black tree or hash table to store and manage file descriptors and associated events.

poll(2) operates in level-triggered mode, where it notifies the application whenever the state of a file descriptor is ready for I/O operations. This can lead to redundant notifications in some scenarios. epoll(7) provides both level-triggered and edge-triggered modes. In edge-triggered mode, it notifies the application only when there’s a change in the readiness state of a file descriptor, reducing unnecessary notifications and potentially improving performance.

poll relies on the poll(2) system call, which involves copying file descriptor sets between user and kernel space, potentially leading to additional overhead while epoll(7) leverages kernel event mechanisms directly, minimizing context switches and overhead associated with copying data between user and kernel space. This direct integration results in improved performance and lower latency.

io_uring - linux kernel system call

Despite its advantages, epoll(7) wasn’t universally supported across all platforms, which posed compatibility challenges for developers.

To address the limitations of traditional I/O methods, the io_uring framework was introduced in the Linux kernel version 5.1. io_uring provides a more efficient and scalable mechanism for managing asynchronous I/O operations.

It leverages features like kernel-bypass and shared memory to reduce context switches and data copying overhead. It aims to provide low-latency, high-throughput I/O operations suitable for modern applications, including databases, web servers, and storage systems.

io_uring achieves above by utilizing a ring buffer data structure for submitting and processing I/O requests. This ring buffer, maintained by the kernel, contains entries representing I/O operations to be performed. Applications can submit I/O requests to the ring buffer without blocking and continue their execution. The kernel then processes these requests asynchronously, notifying the application when the operations are completed.

Request flow while using io_uring

Submission Queue (SQ): Applications submit I/O requests to the submission queue, specifying the operation type (read, write, etc.) and relevant parameters (file descriptor, buffer location, offset, etc.). This queue resides in user space.

Completion Queue (CQ): After processing the submitted requests, the kernel places completion events in the completion queue, indicating the status and result of each I/O operation. This queue also resides in user space.

Kernel Handling: The kernel efficiently processes I/O operations asynchronously, minimizing context switches and data copying between user and kernel space. It manages the submission and completion queues, ensuring that I/O operations are executed in an optimal manner.

Ring Buffer: io_uring utilizes a ring buffer structure to facilitate communication between the application and the kernel. This buffer holds the submission and completion queues, allowing efficient data exchange without the need for frequent memory allocations or deallocations.

Batch Submission: Traditionally, when submitting multiple I/O operations, each operation would be submitted individually, incurring overhead due to context switches and system calls. Batch submission allows multiple I/O operations to be submitted together as a single batch, reducing overhead and improving efficiency.

For example, instead of submitting multiple write operations to write data to a file one by one, io_uring allows the application to submit a batch of write operations at once, reducing the number of system calls and improving throughput.

Linked I/O Operations: Linked I/O operations enable chaining multiple I/O operations together, where the completion of one operation triggers the initiation of the next operation in the chain. This feature is particularly useful for scenarios where I/O operations depend on each other or need to be executed in a specific order.

For instance, in a database system, when updating a record involves multiple disk operations (e.g., reading data, modifying data, writing data), linked I/O operations can be used to ensure that each operation is performed sequentially and efficiently.

Efficient Buffer Management: io_uring optimizes buffer management by allowing applications to use shared memory buffers between user and kernel space, eliminating the need for data copying during I/O operations. This reduces memory overhead and improves performance, especially for large data transfers.


io uring buffer rings
Source: (https://qdrant.tech/articles/io_uring/)

Additionally, io_uring supports scatter-gather I/O, where data can be scattered across multiple buffers or gathered from multiple buffers in a single I/O operation. This flexibility in buffer management allows applications to efficiently handle complex data structures without incurring additional overhead.

Watch this video for checking out the detailed components and behind the scene of io_uring working https://cloudflare.tv/shows/low-level-linux/missing-manuals-io-uring-worker-pool/5vplD9vP


io uring buffer rings
Source: (https://cloudflare.tv/shows/low-level-linux/missing-manuals-io-uring-worker-pool/5vplD9vP)

By leveraging event-driven architectures, non-blocking I/O operations, and effective concurrency models, asynchronous I/O runtimes empower developers to tackle complex I/O-bound tasks efficiently. Asynchronous programming paradigms continue to evolve, driving innovation in software development and shaping the future of asynchronous computing.

This code is a simple demonstration of how to set up and use the io_uring interface in a C program. It initializes an io_uring instance, opens a file specified by the user, reads data from the file using asynchronous I/O operations, and then closes the file and releases the resources associated with the io_uring instance.

Sample Program from liburing repo (https://github.com/axboe/liburing/blob/master/examples/io_uring-test.c)

/* SPDX-License-Identifier: MIT */
/*
* Simple app that demonstrates how to setup an io_uring interface,
* submit and complete IO against it, and then tear it down.
*
* gcc -Wall -O2 -D_GNU_SOURCE -o io_uring-test io_uring-test.c -luring
*/
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "liburing.h"

#define QD 4

int main(int argc, char *argv[])
{
   struct io_uring ring;
   int i, fd, ret, pending, done;
   struct io_uring_sqe *sqe;
   struct io_uring_cqe *cqe;
   struct iovec *iovecs;
   struct stat sb;
   ssize_t fsize;
   off_t offset;
   void *buf;

   if (argc < 2)
   {
       printf("%s: file\n", argv[0]);
       return 1;
   }

   ret = io_uring_queue_init(QD, &ring, 0);
   if (ret < 0)
   {
       fprintf(stderr, "queue_init: %s\n", strerror(-ret));
       return 1;
   }

   fd = open(argv[1], O_RDONLY);
   if (fd < 0)
   {
       perror("open");
       return 1;
   }

   if (fstat(fd, &sb) < 0)
   {
       perror("fstat");
       return 1;
   }

   fsize = 0;
   iovecs = calloc(QD, sizeof(struct iovec));
   for (i = 0; i < QD; i++)
   {
       if (posix_memalign(&buf, 4096, 4096))
           return 1;
       iovecs[i].iov_base = buf;
       iovecs[i].iov_len = 4096;
       fsize += 4096;
   }

   offset = 0;
   i = 0;
   do
   {
       sqe = io_uring_get_sqe(&ring);
       if (!sqe)
           break;
       io_uring_prep_readv(sqe, fd, &iovecs[i], 1, offset);
       offset += iovecs[i].iov_len;
       i++;
       if (offset >= sb.st_size)
           break;
   } while (1);

   ret = io_uring_submit(&ring);
   if (ret < 0)
   {
       fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
       return 1;
   }
   else if (ret != i)
   {
       fprintf(stderr, "io_uring_submit submitted less %d\n", ret);
       return 1;
   }

   done = 0;
   pending = ret;
   fsize = 0;
   for (i = 0; i < pending; i++)
   {
       ret = io_uring_wait_cqe(&ring, &cqe);
       if (ret < 0)
       {
           fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
           return 1;
       }

       done++;
       ret = 0;
       if (cqe->res != 4096 && cqe->res + fsize != sb.st_size)
       {
           fprintf(stderr, "ret=%d, wanted 4096\n", cqe->res);
           ret = 1;
       }
       fsize += cqe->res;
       io_uring_cqe_seen(&ring, cqe);
       if (ret)
           break;
   }

   printf("Submitted=%d, completed=%d, bytes=%lu\n", pending, done,
          (unsigned long)fsize);
   close(fd);
   io_uring_queue_exit(&ring);
   return 0;
}

References

Recent Posts

Building a Rate Limiter or RPM based Throttler for your API/worker
Building a simple rate limiter / throttler based on GCRA algorithm and redis script
MicroVMs, Isolates, Wasm, gVisor: A New Era of Virtualization
Exploring the evolution and nuances of serverless architectures, focusing on the emergence of MicroVMs as a solution for isolation, security, and agility. We will discuss the differences between containers and MicroVMs, their use cases in serverless setups, and highlights notable MicroVM implementations by various companies. Focusing on FirecrackerVM, V8 isolates, wasmruntime and gVisor.
Sum of Bytes Tech Stack
Details about this blog tech stack and how I migrated from Ghost on Hetzner to Phoenix based markdown blog with bunch of technologies (Phoenix + Nimble Publisher + Neon Tech + Hetzner + Kamal Deploy + wsrv.nl) connected to provide me great experience

Get the "Sum of bytes" newsletter in your inbox
No spam. Just the interesting tech which makes scale possible.