In a previous article, we explained how you can use the sendfile() syscall to reduce the overhead of data transfer from a disk to a network. Now, we're going to cover another aspect of network connection control that can help maximize sendfile() capabilities in real life situations—setting TCP/IP options to control socket behavior.
TCP/IP data transfer
The data transfer in a TCP/IP network is usually block-based. From a programmer’s point of view, sending data means issuing a series of “send data block” requests. On a system level, sending an individual block of data could be performed by a write() or sendfile() syscall. At the network level, you will see more data blocks, usually called frames, which are ordered sets of bytes with headers traveling across the wires. What is inside the frame and its header is defined by several protocol layers, from the physical to the application layer of the OSI model.
The length and sequence of network packets is under the control of the programmer because the programmer chooses the most appropriate application protocol to be used in a network connection. Equally important, the programmer must select the way this protocol is implemented in software. The TCP/IP protocol itself has many interoperable implementations, so when two parties are communicating, each could have its own low-level behavior—another fact the programmer should be aware of.
Normally, the programmer need not worry about tinkering with the way that the underlying operating system and network stack sends and receives network data. The built-in algorithms define the low-level data organization and transmission; however, there are some ways to influence the behavior of these algorithms and provide more control on network connections. For example, if an application protocol uses timeouts and retransmission, the programmer might want to set or obtain the timeout parameters. He or she might also need to increase the size of send and receive buffers to ensure uninterrupted information flow in the network. The general way to change the conduct of the TCP/IP stack is through so-called TCP/IP options. Let's take a look at how you can use them to optimize the data transmission.
There are many options that alter the behavior of the TCP/IP stack. Using these options can have adverse effects on other applications running on the same computer, so they are normally unavailable for ordinary users (other than root). We will concentrate on options that change the operations of an individual connection or socket in TCP/IP terms.
The ioctl-style getsockopt() and setsockopt() system calls provide the means to control socket behavior. For example, to set the TCP_NODELAY option in Linux, it is necessary to code as shown in Listing A.
Although there are many TCP options to manipulate, we'll focus on just two of them here, TCP_NODELAY and TCP_CORK, which both significantly influence the behavior of network connection. TCP_NODELAY is implemented on many UNIX systems, but TCP_CORK is Linux-specific and relatively new; it was first implemented in the kernel version 2.4. Other UNIX flavors could have functionally similar options, notably the TCP_NOPUSH option on a BSD-derived system, which is actually one part of T/TCP implementation.
TCP_NODELAY and TCP_CORK basically control packet “Nagling,” or automatic concatenation of small packets into bigger frames performed by a Nagle algorithm. John Nagle, after whom this process was named, first implemented this as a way to fight Ford’s network congestion in 1984. (See IETF RFC 896 for more details.) The problem he solved was the so-called silly window syndrome, where congestion occurred simply because widespread terminal applications sent keystrokes one per packet, typically one byte of payload and 40 bytes of header, thus causing 4,000 percent overhead. Nagling became standard and was aggressively implemented over the Internet. It is now considered a default, but as we'll see, there are situations when turning it off is desirable. Let's say an application just issued a request to send a small block of data. Now, we could either send the data immediately or wait for more data. Some interactive and client-server applications will benefit greatly if we send the data right away. For example, when we are sending a short request and awaiting a large response, the relative overhead is low compared to the total amount of data transferred, and the response time could be much better if the request is sent immediately. This is achieved by setting the TCP_NODELAY option on the socket, which disables the Nagle algorithm.
Another case involves waiting until we have the maximum amount of data the network can send at once, benefiting the performance of the large data transfers—typically any file servers. The Nagle algorithm looks to accommodate these cases. But if you're sending a large amount of data, you could set a TCP_CORK option to disable Nagling in a way that's opposite to how TCP_NODELAY does it. (TCP_CORK and TCP_NODELAY are mutually exclusive.) Let's take a closer look at how this works.
Imagine that the application using sendfile() transfers bulk data. Application protocols usually require sending some information that helps interpret the data first, known as a header. Typically, the header is small, and the TCP_NODELAY is set on the socket. The packet with the header will be transmitted immediately and, in some cases (depending on internal packet counters), it could even cause a request of acknowledgement that this packet was successfully received by the other side. Thus, the transfer of bulk data will be delayed and unnecessary network traffic exchanged.
But if we set the TCP_CORK option on the socket, our header packet will be padded with the bulk data and all the data will be transferred automatically in the packets according to size. When finished with the bulk data transfer, it is advisable to “uncork” the connection by unsetting the TCP_CORK option so that any partial frames that are left can go out. This is equally important to “corking.”
To sum it up, we recommend setting the TCP_CORK option when you're sure that you will be sending multiple data sets together (such as header and a body of HTTP response), with no delays between them. This can greatly benefit the performance of WWW, FTP, and file servers, as well as simplifying your life. Listing B provides an example.
Unfortunately, many popular programs do not take these considerations into account. For example, Eric Allman’s sendmail does not set any options on its sockets, although its performance is quite low anyway, so there may be nothing to optimize.
Apache HTTPD—the most popular Web server on the Internet—has the TCP_NODELAY option set on all its sockets, and its performance is regarded as satisfactory by most users. Why? The answer lies in implementation differences. BSD-derived TCP/IP stacks (notably FreeBSD) operate differently in this situation. When submitting a large amount of small data blocks for transmission in TCP_NODELAY mode, a large amount of information will be sent, one per each write() call. However, the probability of introducing delays will be much lower because the counters that are responsible for requesting acknowledgements of delivery are byte-oriented and not packet-oriented (as in Linux.) Thus, only total size will matter. Whereas Linux asks for acknowledgement after the first packet, FreeBSD will wait for hundred of packets before doing the same.
In Linux, the effect of TCP_NODELAY could be quite different from what is expected by a developer who is used to BSD-derived TCP/IP stacks, and Apache on Linux performs worse than it could. The same is true for many other applications actively using TCP_NODELAY on Linux.
Get the best of both
Your data transmission needs won't always conform neatly to one option or the other. In that case, you may want to take advantage of a more flexible approach for controlling a network connection: Set TCP_CORK before sending a series of data that should be considered as a single message and set TCP_NODELAY before sending short messages that should be sent immediately.
Combined with a zero-copy approach and sendfile() syscall (as covered in a previous article), this technique could significantly improve total system throughput and decrease CPU load. Our experience in using this combined approach for developing a name-based hosting subsystem for SWsoft’s Virtuozzo technology demonstrates it is possible to achieve almost 9,000 HTTP requests per second on a 350-MHz Pentium II PC, which was considered practically impossible before. The performance gain is tremendous.