One of the most important problems with supporting rapid Internet growth is the
growing need to streamline data transfer from the server to the client. Ask any
number of people to define the criteria for the optimal way to transfer data,
and you are likely to receive different answers each time: One person might say
the answer lies in simply maximizing transfer speed; for another, it might mean
guaranteed maximum delay, and a third person may mention zero packet loss, among
In reality, there is no single answer to this data transfer question. Most people still say “megabytes per second” when they talk about performance; however, they should really think about CPU time spent per megabyte transferred. Real-time applications, like audio or video streaming, can be mired in delays. Implementation of the protocol-level load balancing and private IP name-based hosting support (a part of OS virtualization technology called Virtuozzo) is impossible without CPU-efficient implementation. One physical box loaded with Virtuozzo could host thousands of sandboxed Web sites, so it is very important that data transfer is using as little CPU as possible.
Sendfile() is a relatively new operating system kernel primitive, which was introduced to solve the aforementioned problems. It is available in the latest kernel editions (UNIX, Linux, Solaris 8). Technically, sendfile() is a system call for data transfer between the disk and TCP socket, but it can also be used for moving data between any two file descriptors. Implementations are not exactly the same on all systems, but the differences are minor, and the assumption is that we are using the Linux kernel version 2.4.
The prototype of this syscall is:
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
The strength of sendfile() is that it provides an access to the expansive features in the current Linux networking stack—“zero-copy” mechanism for transmitting TCP frames directly from host memory into network card buffers. To understand zero-copy and sendfile() better, let us consider what needs to be done to send a file to the socket in pre-sendfile era. First, a data buffer in the user space is allocated. Then, we have to use read() syscall to copy data from the file into this buffer. (Usually, this operation copies data from disk into OS cache and then copies them again from cache to the user space, performing a so-called “context switch.”) After that, we should use write() syscall to send a content of the buffer to a network:
int out_fd, int in_fd;
/* unsubstantial code skipped for clarity */
read(in_fd, buffer, BUFLEN); /* syscall, make context switch */
write(out_fd, buffer, BUFLEN); /* syscall, make context switch */
The OS kernel had to copy all data at least twice: from kernel space into user space and back. Each operation required a context-switch procedure, involving many complex and CPU-intensive operations. The system utility vmstat could be used to display the current context switch rate on most UNIX-like operating systems. Look at the column called “cs” for a number of context switches that happened during the sample period. Play with the different load types to see the effect they cause on this parameter.
Detailing the switching process
Let us dig deeper into the process of switching the context to understand related expenses better. There are many operations involved in the process of calling systems from the user space. For example, it is necessary to switch pages of virtual memory from user space into kernel once and back. This process requires execution of relatively expensive (in terms of CPU cycles) instructions, working with memory page control tables called Global Descriptor Table (GDT) and Local Descriptor Table (LDT). Another structure called TSS (Task Status Segment) also requires attention.
Moreover, there are some implicit and very expensive operations not caused directly by a context switch procedure. We can illustrate this on the example of virtual-physical address translation operation needed to support virtual memory. The data required for this translation (page table) are also stored in the memory, so each CPU request for a memory location will require one or more accesses to the main memory (to read the translation table entries), in addition to the access to fetch the requested data. Contemporary CPU normally includes a translation look-aside buffer—commonly abbreviated as TLB. A TLB serves as a cache for page table entries, storing most recently accessed ones. (This is a simplified explanation.) TLB cache miss has a large potential cost—several memory accesses and the execution of the page fault handler. Copying a lot of data will imminently cause elimination of the TLB cache—it will only contain pages for data copied.
With the sendfile() zero-copy approach, the data is read immediately from the disk into the OS cache memory using Direct Memory Access (DMA) hardware, if possible. The TLB cache is left intact. The performance of applications utilizing sendfile() primitive is high because this system call does not directly point to memory and, therefore, minimizes performance overhead. Data to be transferred is usually taken directly from system buffers, without context switching, and without trashing the cache. Thus, the usage of sendfile() in server applications can significantly reduce CPU load.
Replacing read() with mmap() in our example will not change much. However, the mmap syscall is asking to map some bytes from the file (or other object) specified by the file descriptor into virtual memory. Attempting to read data from this memory will result in disk operations. With this call we can eliminate read operations because the system will write mapped memory directly into the socket without calling read() explicitly and without buffer allocation. Nevertheless, this operation does cause the TLB cache flushing, so CPU load per byte transferred will be higher.
The zero-copy approach and applications development
The zero-copy approach should be used whenever possible in performance-sensitive client-server applications development. Imagine that we want to run up to 1,000 separate Apache Web servers with private IP addresses on the stand-alone physical server, using the abovementioned Virtuozzo technology. To do this, we have to process thousands of requests per second on the TCP protocol level to parse the client’s requests and extract the name of the host. This is a very processor-intensive task in and of itself, and without optimization and zero-copy support, the performance level will be limited by the CPU, rather than the network. Fine-tuned zero-copy sendfile()-based implementation yields up to a 9K-http-requests-per-second rate, even on the relatively slow 350 MHz Pentium II processor.
However, zero-copy sendfile() is not a panacea for all problems. In particular, to minimize the number of network operations, the sendfile() syscall should be used together with the TCP/IP option called TCP_CORK. Our next article will discuss the applications benefiting from this option and the related issues.