Use sendfile() to optimize data transfer


In reality, there is no single answer to this data transfer question. Most people still say “megabytes per second” when they talk about performance; however, they should really think about CPU time spent per megabyte transferred. Real-time applications, like audio or video streaming, can be mired in delays. Implementation of the protocol-level load balancing and private IP name-based hosting support (a part of OS virtualization technology called Virtuozzo) is impossible without CPU-efficient implementation. One physical box loaded with Virtuozzo could host thousands of sandboxed Web sites, so it is very important that data transfer is using as little CPU as possible.
Sendfile() is a relatively new operating system kernel primitive, which was introduced to solve the aforementioned problems. It is available in the latest kernel editions (UNIX, Linux, Solaris 8). Technically, sendfile() is a system call for data transfer between the disk and TCP socket, but it can also be used for moving data between any two file descriptors. Implementations are not exactly the same on all systems, but the differences are minor, and the assumption is that we are using the Linux kernel version 2.4.
The prototype of this syscall is:
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count)
The strength of sendfile() is that it provides an access to the expansive features in the current Linux networking stack—“zero-copy” mechanism for transmitting TCP frames directly from host memory into network card buffers. To understand zero-copy and sendfile() better, let us consider what needs to be done to send a file to the socket in pre-sendfile era. First, a data buffer in the user space is allocated. Then, we have to use read() syscall to copy data from the file into this buffer. (Usually, this operation copies data from disk into OS cache and then copies them again from cache to the user space, performing a so-called “context switch.”) After that, we should use write() syscall to send a content of the buffer to a network:
int out_fd, int in_fd;
char buffer[BUFLEN];
…
/* unsubstantial code skipped for clarity */
…
read(in_fd, buffer, BUFLEN); /* syscall, make context switch */
write(out_fd, buffer, BUFLEN); /* syscall, make context switch */
The OS kernel had to copy all data at least twice: from kernel space into user space and back. Each operation required a context-switch procedure, involving many complex and CPU-intensive operations. The system utility vmstat could be used to display the current context switch rate on most UNIX-like operating systems. Look at the column called “cs” for a number of context switches that happened during the sample period. Play with the different load types to see the effect they cause on this parameter.
Detailing the switching process
Let us dig deeper into the process of switching the context to understand
related expenses better. There are many operations involved in the process of
calling systems from the user space. For example, it is necessary to switch
pages of virtual memory from user space into kernel once and back. This process
requires execution of relatively expensive (in terms of CPU cycles)
instructions, working with memory page control tables called Global Descriptor
Table (GDT) and Local Descriptor Table (LDT). Another structure called TSS (Task
Status Segment) also requires attention.
Moreover, there are some
implicit and very expensive operations not caused directly by a context switch
procedure. We can illustrate this on the example of virtual-physical address
translation operation needed to support virtual memory. The data required for
this translation (page table) are also stored in the memory, so each CPU request
for a memory location will require one or more accesses to the main memory (to
read the translation table entries), in addition to the access to fetch the
requested data. Contemporary CPU normally includes a translation look-aside
buffer—commonly abbreviated as TLB. A TLB serves as a cache for page table
entries, storing most recently accessed ones. (This is a simplified
explanation.) TLB cache miss has a large potential cost—several memory accesses
and the execution of the page fault handler. Copying a lot of data will
imminently cause elimination of the TLB cache—it will only contain pages for
data copied.
With the sendfile() zero-copy approach, the data is read
immediately from the disk into the OS cache memory using Direct Memory Access
(DMA) hardware, if possible. The TLB cache is left intact. The performance of
applications utilizing sendfile() primitive is high because this system call
does not directly point to memory and, therefore, minimizes performance
overhead. Data to be transferred is usually taken directly from system buffers,
without context switching, and without trashing the cache. Thus, the usage of
sendfile() in server applications can significantly reduce CPU
load.
Replacing read() with mmap() in our example will not change much.
However, the mmap syscall is asking to map some bytes from the file (or other
object) specified by the file descriptor into virtual memory. Attempting to read
data from this memory will result in disk operations. With this call we can
eliminate read operations because the system will write mapped memory directly
into the socket without calling read() explicitly and without buffer allocation.
Nevertheless, this operation does cause the TLB cache flushing, so CPU load per
byte transferred will be higher.
The zero-copy approach and applications development
The zero-copy approach should be used whenever possible in performance-sensitive
client-server applications development. Imagine that we want to run up to 1,000
separate Apache Web servers with private IP addresses on the stand-alone
physical server, using the abovementioned Virtuozzo technology. To do this, we
have to process thousands of requests per second on the TCP protocol level to
parse the client’s requests and extract the name of the host. This is a very
processor-intensive task in and of itself, and without optimization and
zero-copy support, the performance level will be limited by the CPU, rather than
the network. Fine-tuned zero-copy sendfile()-based implementation yields up to a
9K-http-requests-per-second rate, even on the relatively slow 350 MHz Pentium II
processor.
However, zero-copy sendfile() is not a panacea for all
problems. In particular, to minimize the number of network operations, the
sendfile() syscall should be used together with the TCP/IP option called
TCP_CORK. Our next article will discuss the applications benefiting from this
option and the related issues.