In a previous
, we explained how you can use the sendfile()
reduce the overhead of data transfer from a disk to a network. Now, we're going
to cover another aspect of network connection control that can help maximize
capabilities in real life situations—setting TCP/IP options to
control socket behavior.
TCP/IP data transfer
The data transfer in a TCP/IP network is usually block-based. From a
programmer’s point of view, sending data means issuing a series of “send data
block” requests. On a system level, sending an individual block of data could be
performed by a write() or sendfile() syscall. At the network
level, you will see more data blocks, usually called frames, which are ordered
sets of bytes with headers traveling across the wires. What is inside the frame
and its header is defined by several protocol layers, from the physical to the
application layer of the OSI model.
The length and sequence of network
packets is under the control of the programmer because the programmer chooses
the most appropriate application protocol to be used in a network connection.
Equally important, the programmer must select the way this protocol is
implemented in software. The TCP/IP protocol itself has many interoperable
implementations, so when two parties are communicating, each could have its own
low-level behavior—another fact the programmer should be aware
Normally, the programmer need not worry about tinkering with the way
that the underlying operating system and network stack sends and receives
network data. The built-in algorithms define the low-level data organization and
transmission; however, there are some ways to influence the behavior of these
algorithms and provide more control on network connections. For example, if an
application protocol uses timeouts and retransmission, the programmer might want
to set or obtain the timeout parameters. He or she might also need to increase
the size of send and receive buffers to ensure uninterrupted information flow in
the network. The general way to change the conduct of the TCP/IP stack is
through so-called TCP/IP options. Let's take a look at how you can use them to
optimize the data transmission.
There are many options that alter the behavior of the TCP/IP stack. Using these
options can have adverse effects on other applications running on the same
computer, so they are normally unavailable for ordinary users (other than root).
We will concentrate on options that change the operations of an individual
connection or socket in TCP/IP terms.
The ioctl-style getsockopt()
and setsockopt() system calls provide the means to control socket
behavior. For example, to set the TCP_NODELAY option in Linux, it is
necessary to code as shown in Listing A.
there are many TCP options to manipulate, we'll focus on just two of them here,
TCP_NODELAY and TCP_CORK, which both significantly influence the
behavior of network connection. TCP_NODELAY is implemented on many UNIX
systems, but TCP_CORK is Linux-specific and relatively new; it was first
implemented in the kernel version 2.4. Other UNIX flavors could have
functionally similar options, notably the TCP_NOPUSH option on a
BSD-derived system, which is actually one part of T/TCP
TCP_NODELAY and TCP_CORK basically control
packet “Nagling,” or automatic concatenation of small packets into bigger frames
performed by a Nagle algorithm. John Nagle, after whom this process was named,
first implemented this as a way to fight Ford’s network congestion in 1984. (See
IETF RFC 896 for
more details.) The problem he solved was the so-called silly window
syndrome, where congestion occurred simply because widespread terminal
applications sent keystrokes one per packet, typically one byte of payload and
40 bytes of header, thus causing 4,000 percent overhead. Nagling became standard
and was aggressively implemented over the Internet. It is now considered a
default, but as we'll see, there are situations when turning it off is
Let's say an application just issued a request to send a small block of data.
Now, we could either send the data immediately or wait for more data. Some
interactive and client-server applications will benefit greatly if we send the
data right away. For example, when we are sending a short request and awaiting a
large response, the relative overhead is low compared to the total amount of
data transferred, and the response time could be much better if the request is
sent immediately. This is achieved by setting the TCP_NODELAY
the socket, which disables the Nagle algorithm.
Another case involves
waiting until we have the maximum amount of data the network can send at once,
benefiting the performance of the large data transfers—typically any file
servers. The Nagle algorithm looks to accommodate these cases. But if you're
sending a large amount of data, you could set a TCP_CORK
disable Nagling in a way that's opposite to how TCP_NODELAY
and TCP_NODELAY are mutually exclusive.) Let's take a closer
look at how this works.
Imagine that the application using
transfers bulk data. Application protocols usually require
sending some information that helps interpret the data first, known as a header.
Typically, the header is small, and the TCP_NODELAY
is set on the socket.
The packet with the header will be transmitted immediately and, in some cases
(depending on internal packet counters), it could even cause a request of
acknowledgement that this packet was successfully received by the other side.
Thus, the transfer of bulk data will be delayed and unnecessary network traffic
But if we set the TCP_CORK
option on the socket, our
header packet will be padded with the bulk data and all the data will be
transferred automatically in the packets according to size. When finished with
the bulk data transfer, it is advisable to “uncork” the connection by unsetting
option so that any partial frames that are left can go out.
This is equally important to “corking.”
To sum it up, we recommend
setting the TCP_CORK
option when you're sure that you will be sending
multiple data sets together (such as header and a body of HTTP response), with
no delays between them. This can greatly benefit the performance of WWW, FTP,
and file servers, as well as simplifying your life. Listing B
provides an example.
many popular programs do not take these considerations into account. For
example, Eric Allman’s sendmail
does not set any options on its sockets, although its performance is quite low
anyway, so there may be nothing to optimize.
Apache HTTPD—the most
popular Web server on the Internet—has the TCP_NODELAY
option set on all
its sockets, and its performance is regarded as satisfactory by most users. Why?
The answer lies in implementation differences. BSD-derived TCP/IP stacks
(notably FreeBSD) operate differently in this situation. When submitting a large
amount of small data blocks for transmission in TCP_NODELAY
mode, a large
amount of information will be sent, one per each write()
the probability of introducing delays will be much lower because the counters
that are responsible for requesting acknowledgements of delivery are
byte-oriented and not packet-oriented (as in Linux.) Thus, only total size will
matter. Whereas Linux asks for acknowledgement after the first packet, FreeBSD
will wait for hundred of packets before doing the same.
In Linux, the
effect of TCP_NODELAY
could be quite different from what is expected by a
developer who is used to BSD-derived TCP/IP stacks, and Apache on Linux performs
worse than it could. The same is true for many other applications actively using
Get the best of both
Your data transmission needs won't always conform neatly to one option or the
other. In that case, you may want to take advantage of a more flexible approach
for controlling a network connection: Set TCP_CORK before sending a
series of data that should be considered as a single message and set
TCP_NODELAY before sending short messages that should be sent
Combined with a zero-copy approach and sendfile()
syscall (as covered in a previous
article), this technique could significantly improve total system throughput
and decrease CPU load. Our experience in using this combined approach for
developing a name-based hosting subsystem for SWsoft’s Virtuozzo technology demonstrates it is
possible to achieve almost 9,000 HTTP requests per second on a 350-MHz Pentium
II PC, which was considered practically impossible before. The performance gain