TCP/IP options for high-performance data transmission
TCP/IP data transfer
The data transfer in a TCP/IP network is usually block-based. From a
programmer’s point of view, sending data means issuing a series of “send data
block” requests. On a system level, sending an individual block of data could be
performed by a write() or sendfile() syscall. At the network
level, you will see more data blocks, usually called frames, which are ordered
sets of bytes with headers traveling across the wires. What is inside the frame
and its header is defined by several protocol layers, from the physical to the
application layer of the OSI model.
The length and sequence of network
packets is under the control of the programmer because the programmer chooses
the most appropriate application protocol to be used in a network connection.
Equally important, the programmer must select the way this protocol is
implemented in software. The TCP/IP protocol itself has many interoperable
implementations, so when two parties are communicating, each could have its own
low-level behavior—another fact the programmer should be aware
of.
Normally, the programmer need not worry about tinkering with the way
that the underlying operating system and network stack sends and receives
network data. The built-in algorithms define the low-level data organization and
transmission; however, there are some ways to influence the behavior of these
algorithms and provide more control on network connections. For example, if an
application protocol uses timeouts and retransmission, the programmer might want
to set or obtain the timeout parameters. He or she might also need to increase
the size of send and receive buffers to ensure uninterrupted information flow in
the network. The general way to change the conduct of the TCP/IP stack is
through so-called TCP/IP options. Let's take a look at how you can use them to
optimize the data transmission.
TCP/IP options
There are many options that alter the behavior of the TCP/IP stack. Using these
options can have adverse effects on other applications running on the same
computer, so they are normally unavailable for ordinary users (other than root).
We will concentrate on options that change the operations of an individual
connection or socket in TCP/IP terms.
The ioctl-style getsockopt()
and setsockopt() system calls provide the means to control socket
behavior. For example, to set the TCP_NODELAY option in Linux, it is
necessary to code as shown in Listing A.
Although
there are many TCP options to manipulate, we'll focus on just two of them here,
TCP_NODELAY and TCP_CORK, which both significantly influence the
behavior of network connection. TCP_NODELAY is implemented on many UNIX
systems, but TCP_CORK is Linux-specific and relatively new; it was first
implemented in the kernel version 2.4. Other UNIX flavors could have
functionally similar options, notably the TCP_NOPUSH option on a
BSD-derived system, which is actually one part of T/TCP
implementation.
TCP_NODELAY and TCP_CORK basically control
packet “Nagling,” or automatic concatenation of small packets into bigger frames
performed by a Nagle algorithm. John Nagle, after whom this process was named,
first implemented this as a way to fight Ford’s network congestion in 1984. (See
IETF RFC 896 for
more details.) The problem he solved was the so-called silly window
syndrome, where congestion occurred simply because widespread terminal
applications sent keystrokes one per packet, typically one byte of payload and
40 bytes of header, thus causing 4,000 percent overhead. Nagling became standard
and was aggressively implemented over the Internet. It is now considered a
default, but as we'll see, there are situations when turning it off is
desirable.
Another case involves waiting until we have the maximum amount of data the network can send at once, benefiting the performance of the large data transfers—typically any file servers. The Nagle algorithm looks to accommodate these cases. But if you're sending a large amount of data, you could set a TCP_CORK option to disable Nagling in a way that's opposite to how TCP_NODELAY does it. (TCP_CORK and TCP_NODELAY are mutually exclusive.) Let's take a closer look at how this works.
Imagine that the application using sendfile() transfers bulk data. Application protocols usually require sending some information that helps interpret the data first, known as a header. Typically, the header is small, and the TCP_NODELAY is set on the socket. The packet with the header will be transmitted immediately and, in some cases (depending on internal packet counters), it could even cause a request of acknowledgement that this packet was successfully received by the other side. Thus, the transfer of bulk data will be delayed and unnecessary network traffic exchanged.
But if we set the TCP_CORK option on the socket, our header packet will be padded with the bulk data and all the data will be transferred automatically in the packets according to size. When finished with the bulk data transfer, it is advisable to “uncork” the connection by unsetting the TCP_CORK option so that any partial frames that are left can go out. This is equally important to “corking.”
To sum it up, we recommend setting the TCP_CORK option when you're sure that you will be sending multiple data sets together (such as header and a body of HTTP response), with no delays between them. This can greatly benefit the performance of WWW, FTP, and file servers, as well as simplifying your life. Listing B provides an example.
Unfortunately, many popular programs do not take these considerations into account. For example, Eric Allman’s sendmail does not set any options on its sockets, although its performance is quite low anyway, so there may be nothing to optimize.
Apache HTTPD—the most popular Web server on the Internet—has the TCP_NODELAY option set on all its sockets, and its performance is regarded as satisfactory by most users. Why? The answer lies in implementation differences. BSD-derived TCP/IP stacks (notably FreeBSD) operate differently in this situation. When submitting a large amount of small data blocks for transmission in TCP_NODELAY mode, a large amount of information will be sent, one per each write() call. However, the probability of introducing delays will be much lower because the counters that are responsible for requesting acknowledgements of delivery are byte-oriented and not packet-oriented (as in Linux.) Thus, only total size will matter. Whereas Linux asks for acknowledgement after the first packet, FreeBSD will wait for hundred of packets before doing the same.
In Linux, the effect of TCP_NODELAY could be quite different from what is expected by a developer who is used to BSD-derived TCP/IP stacks, and Apache on Linux performs worse than it could. The same is true for many other applications actively using TCP_NODELAY on Linux.
Get the best of both
Your data transmission needs won't always conform neatly to one option or the
other. In that case, you may want to take advantage of a more flexible approach
for controlling a network connection: Set TCP_CORK before sending a
series of data that should be considered as a single message and set
TCP_NODELAY before sending short messages that should be sent
immediately.
Combined with a zero-copy approach and sendfile()
syscall (as covered in a previous
article), this technique could significantly improve total system throughput
and decrease CPU load. Our experience in using this combined approach for
developing a name-based hosting subsystem for SWsoft’s Virtuozzo technology demonstrates it is
possible to achieve almost 9,000 HTTP requests per second on a 350-MHz Pentium
II PC, which was considered practically impossible before. The performance gain
is tremendous.