TCP/IP data transfer
The data transfer in a TCP/IP network is usually block-based. From a
programmer’s point of view, sending data means issuing a series of “send data
block” requests. On a system level, sending an individual block of data could be
performed by a write() or sendfile() syscall. At the network
level, you will see more data blocks, usually called frames, which are ordered
sets of bytes with headers traveling across the wires. What is inside the frame
and its header is defined by several protocol layers, from the physical to the
application layer of the OSI model.
The length and sequence of network
packets is under the control of the programmer because the programmer chooses
the most appropriate application protocol to be used in a network connection.
Equally important, the programmer must select the way this protocol is
implemented in software. The TCP/IP protocol itself has many interoperable
implementations, so when two parties are communicating, each could have its own
low-level behavior—another fact the programmer should be aware
of.
Normally, the programmer need not worry about tinkering with the way
that the underlying operating system and network stack sends and receives
network data. The built-in algorithms define the low-level data organization and
transmission; however, there are some ways to influence the behavior of these
algorithms and provide more control on network connections. For example, if an
application protocol uses timeouts and retransmission, the programmer might want
to set or obtain the timeout parameters. He or she might also need to increase
the size of send and receive buffers to ensure uninterrupted information flow in
the network. The general way to change the conduct of the TCP/IP stack is
through so-called TCP/IP options. Let's take a look at how you can use them to
optimize the data transmission.
TCP/IP options
There are many options that alter the behavior of the TCP/IP stack. Using these
options can have adverse effects on other applications running on the same
computer, so they are normally unavailable for ordinary users (other than root).
We will concentrate on options that change the operations of an individual
connection or socket in TCP/IP terms.
The ioctl-style getsockopt()
and setsockopt() system calls provide the means to control socket
behavior. For example, to set the TCP_NODELAY option in Linux, it is
necessary to code as shown in Listing A.
Although
there are many TCP options to manipulate, we'll focus on just two of them here,
TCP_NODELAY and TCP_CORK, which both significantly influence the
behavior of network connection. TCP_NODELAY is implemented on many UNIX
systems, but TCP_CORK is Linux-specific and relatively new; it was first
implemented in the kernel version 2.4. Other UNIX flavors could have
functionally similar options, notably the TCP_NOPUSH option on a
BSD-derived system, which is actually one part of T/TCP
implementation.
TCP_NODELAY and TCP_CORK basically control
packet “Nagling,” or automatic concatenation of small packets into bigger frames
performed by a Nagle algorithm. John Nagle, after whom this process was named,
first implemented this as a way to fight Ford’s network congestion in 1984. (See
IETF RFC 896 for
more details.) The problem he solved was the so-called silly window
syndrome, where congestion occurred simply because widespread terminal
applications sent keystrokes one per packet, typically one byte of payload and
40 bytes of header, thus causing 4,000 percent overhead. Nagling became standard
and was aggressively implemented over the Internet. It is now considered a
default, but as we'll see, there are situations when turning it off is
desirable.
Get the best of both
Your data transmission needs won't always conform neatly to one option or the
other. In that case, you may want to take advantage of a more flexible approach
for controlling a network connection: Set TCP_CORK before sending a
series of data that should be considered as a single message and set
TCP_NODELAY before sending short messages that should be sent
immediately.
Combined with a zero-copy approach and sendfile()
syscall (as covered in a previous
article), this technique could significantly improve total system throughput
and decrease CPU load. Our experience in using this combined approach for
developing a name-based hosting subsystem for SWsoft’s Virtuozzo technology demonstrates it is
possible to achieve almost 9,000 HTTP requests per second on a 350-MHz Pentium
II PC, which was considered practically impossible before. The performance gain
is tremendous.