Enhancing TCP Performance Through the Large Window and SACK Options
Articles and Tips: article
Software Engineer
Novell, Inc.
banumita@novell.com
Special thanks to Srinivas Athreya and Rajendra Kumar Gupta of Novell for their assistance with this AppNote.
01 Dec 2002
This AppNote describes two TCP extensions designed to enhance the performance of TCP over high bandwidth and long delay networks: Large Window and Selective Acknowledgement (SACK). It describes the nature of these networks and explains why the original TCP design could not scale to these types of networks. It then looks at two TCP extensions that enhance TCP performance over these types of networks. It discusses how applications can be enhanced using these TCP features, and covers the solutions that have been provided with NetWare 6 to enable TCP to scale reliably in these networks.
- Introduction
- Extending TCP for High Performance
- How Applications Can Benefit from These TCP Options
- Support for These TCP Options on NetWare
- Conclusion
Topics |
Large Window, Selective Acknowledgement (SACK), TCP/IP, network protocols, performance tuning and optimization |
Products |
NetWare 6 TCP/IP stack |
Audience |
network administrators, integrators, developers |
Level |
intermediate |
Prerequisite Skills |
familiarity with TCP/IP |
Operating System |
NetWare 6.x |
Tools |
none |
Sample Code |
no |
Introduction
Transmission Control Protocol (TCP) is the reliable, connection-oriented, transport layer service used by numerous applications today. The services it provides range from reliable file transfer to Web-based transaction services to remote backup, and so on. Fundamental to TCP is its ability to provide sequenced and ordered delivery of data in the face of varying path characteristics such as transmission rates, delays and data loss.
The design of TCP was flexible enough to cater to varying transmission rates in the range of 300 bps to 800 Mbps. However, with the arrival of high bandwidth technologies such as fiber optic networks, TCP was pushed to its design limits and needed to be enhanced to optimally utilize the bandwidth available on these networks. This enhancement came in the form of a standard set of TCP extensions, provided by the Internet Engineering Task Force (IETF).
At the outset, you need to understand the nature of these high bandwidth/long delay networks. A "network" here refers to a collection of connected networks, such as the Internet, through which a TCP connection traces a path. It is the path's characteristics as a whole that define the throughput of the TCP connection, rather than the characteristics of each of the networks lying in the path. When the path has high bandwidth and long delay, then effectively the maximum number of bytes that can be sent on this TCP connection at any time is given by the product of the bandwidth and delay. The path is also known as a pipe, and sending (bandwidth * delay) amount of data bytes at any given point of time is referred to as "filling the pipe." Networks with very high (bandwidth * delay) products are also known as LFN networks (long fat networks, pronounced something like "elephant").
The original design of TCP allows a maximum of only 64 KB of data to be sent at any point in time for a given connection. This amount is not sufficient to fill large pipes of size greater than 64 KB. In order to work around this problem and provide greater performance, TCP was extended to support transfer of data sizes greater than 64 KB. This extension is called the Large Window (LW) option, or "window scale" option.
Another issue with the original design of TCP was that the occurrence of multiple (two or more) segment loss caused entire pipes of data to be drained and retransmissions to be resumed at a far lower rate than before loss detection. This grossly underutilized the fat pipes and degraded performance. To work around this problem, TCP was extended to support Selective Acknowledgements (SACK) of transmitted data.
In the TCP protocol, along with the basic control information that is sent in a TCP header of a TCP segment, optional information can be sent as add-ons to the basic header. These add-ons are termed as TCP "options." LW and SACK are provided as TCP options.
Extending TCP for High Performance
As discussed above, TCP options are provided as add-ons to the basic TCP header. This section briefly describes both the TCP header as well as the format for TCP options in general. For details, refer to RFC 793, available online at http://www.ietf.org/rfc.html.
The following table represents the TCP Header Schematic (each row represents 32 bits).
Source Port |
Destination Port |
||
Sequence Number |
|||
Acknowledgement Number |
|||
Data Offset |
Reserved |
Flags |
Window |
Checksum |
Urgent Pointer |
||
TCP Options . . . |
Padding |
||
Data . . . |
The TCP Option format is as follows:
Type (1 byte) |
Length (1 byte) |
Value (Length - 2 bytes) |
These options are negotiated during the initial three-way handshake when the SYN , SYN-ACK, and ACK segments are exchanged. If any one of the end points establishing the connection does not support the TCP option, this option is disabled for that particular connection by the supporting end point.
The Window Scale Option
This section presents a broad overview of how the Large Window option works.
A TCP receiver maintains a variable called receive window (rcv_wnd) which indicates how much data the receiver can buffer in one go. A TCP sender will adjust its send window (snd_wnd)-the amount of data it can send-according to the TCP receive window and other congestion-related variables. Since the TCP header allocates only 16 bits for the window segment variable (seg_wnd), it is possible for the receiver to advertise only a maximum of 216 bits or 64 KB as its receive window.
By using the window scale option, a TCP receiver can advertise a window of up to 1 GB in size. This has been achieved by upgrading the actual window size to 32 bits and by maintaining a scale factor that is a power of 2, which is used to scale up or scale down the 16 bit window received or sent in a TCP segment (seg_wnd).
Thus, if the scale factor is 1 and a TCP receiver wishes to advertise a window size of 100,000, it advertises a pkt_wnd of 100000 /( 21 ). The result is 50,000, which fits into the 16-bit window field in the TCP header. When a TCP sender reads the rcv_wnd variable from the received segment, it will compute the real receive window value of 100,000 by computing 50000 * ( 21 ).
There are a couple of points to note here:
The scale factor is static and cannot change over the period of the TCP connection.
Even though a 32-bit space is now allowed for the actual window size, the maximum data size is not 4 GB (232 ), but rather 1 GB (230 ). This is done so that a TCP receiver can uniquely identify an incoming data segment as a new segment. There is no possibility of old and new segments with the same sequence number arriving out of turn and confusing the receiver, as sequence numbers themselves are in the 32-bit space.
In the event of even larger pipes coming into existence in the future, the window size cannot be increased any more than 1 GB as long as sequence numbers are 32 bits in length. Thus, using the window scale technique TCP can indeed scale up to 1 GB, but in the event of fatter pipes once again there will be a possibility of pipe under-utilization. Such fat pipes would surely require a drastic overhaul of TCP.
Finally, let's look at how the scale factor is transmitted for the TCP connection. The window scale option conforms to the type, length, and value fields as discussed earlier. The values for the Large Window option are given in the following table:
Type = 3 |
Length = 3 |
Value = Scale Factor |
These three bytes are transmitted as a TCP option during the initial three-way handshake done to establish the connection. Thus, both the ends of the TCP connection inform the other what would be the receive window scale factor for the period of the TCP connection. The TCP traces shown in Figures 1 and 2 show a scale factor of 0 being negotiated in the three-way handshake.
Window scale negotiation in the TCP SYN.
Window scale negotiation in the TCP SYN-ACK.
Thus, this option may be sent in the initial SYN segment. But for backward compatibility reasons, it may be sent in a SYN-ACK segment, but only if a window scale option was received in the initial SYN segment.
If one side of a TCP connection is doing the passive open and does not support the LW option, it will ignore the received TCP option and obviously not send a LW option. Hence, the sender of the first SYN segment should disable use of the LW option for its own receive window.
An LW option found in a non-SYN segment should be ignored. Also, the window field in the SYN segments should not be scaled.
The Selective Acknowledgement Option
TCP implements reliability by sending an acknowledgment for data segments received. This is a cumulative acknowledgement scheme where only in-sequence data segments are acknowledged. In case of loss, subsequent segments received successfully after the segment loss are not acknowledged in this scheme. Hence, the sender has no way of knowing which segments were successfully received after a loss. Therefore, the sender is forced to retransmit all segments after the segment loss is detected by a retransmission timeout.
This leads to retransmission of segments which were actually successfully received by the receiver, which is a waste of network bandwidth. Also, the retransmission timeout causes the congestion window to fall drastically and future transmissions are made at a slower rate than before.
Figure 3 describes a TCP trace where in-order bytes up to the sequence number of 2028597920 are received correctly, at which point there is a segment loss. Unaware of the loss, the sender continues to send data up to 2028605220, at which point it retransmits the lost segment and the entire pipe of data up to 2028605220 again. This results in retransmission of five packets which were actually successfully received.
Typical retransmission and draining of pipe.
By using the Selective Acknowledgement scheme, a receiver can selectively acknowledge segments that were received after the loss. The sender then needs only to retransmit the lost segments. These lost segments or packets are also referred to as "holes" in the data stream.
SACK uses two TCP options. The first is an enabling option, "SACK-permitted", which may be sent in a SYN segment to indicate that the SACK option can be used once the connection is established. The other is the SACK option itself, which may be sent over an established connection only when SACK is permitted by both ends of the connection during connection establishment.
The SACK option is to be included in a segment sent from a TCP node that is receiving data to the TCP node that is sending that data.
The 2-byte SACK permitted option has the following format:
Type = 4 |
Length = 2 |
Figures 4 and 5 show the SACK-permitted option being negotiated during the initial three-way handshake.
SACK-permitted option in the TCP SYN.
SACK-permitted option in the TCP SYN-ACK.
The SACK option has the following format:
Pad Byte or ther option byte |
Pad Byte or ther option byte |
Type = 5 |
Length |
Left edge of first block |
|||
Right edge of first block |
|||
. . . |
|||
Left edge of nth block |
|||
Right edge of nth block |
To explain a little about the SACK option: each block in the option conveys information about each set of contiguous data bytes received and queued. The first block conveys information about the first block of non-contiguous data received after the first data loss occurred. Two consecutive blocks indicate a "hole" or data loss between them.
The left edge of a block is the 32-bit sequence number of the first data byte of this block. The right edge of the block is the 32-bit sequence number of the data byte immediately following the last data byte in this block.
A SACK option with n blocks would require 2 + 8n bytes. Since TCP options can be up to 40 bytes in length, a maximum of four SACK blocks can be sent in one TCP segment. If the timestamp option is present, only three SACK blocks can be fit into one TCP segment.
The table below illustrates an example scenario where data segments of size 100 bytes each are being sent. A lost segment is indicated in the ACK column. The subsequent SACK blocks are described in the corresponding "First Block" and "Second Block" columns.
Triggering Dataegment
|
Acknowledgement Segment
|
||||
- -
|
ACK
|
First Block
|
Second Block
|
||
Left Edge
|
Right Edge
|
Left Edge
|
Right Edge
|
||
100 |
200 |
- - |
- - |
- - |
- - |
200 |
Lost |
- - |
- - |
- - |
- - |
300 |
200 |
300 |
400 |
- - |
- - |
400 |
200 |
300 |
500 |
- - |
- - |
500 |
Lost |
- - |
- - |
- - |
- - |
600 |
200 |
300 |
500 |
600 |
700 |
The next few figures show the actual SACK option format, in the cases of single and multiple segment losses.
Figure 6 denotes an ACK segment that indicates a loss of bytes between sequence numbers 438514567 and 438517487.
SACK option with one block.
Figure 7 denotes two holes and hence two SACK blocks. The most recent SACK block appears before the older SACK block.
SACK option with two blocks (continued in Figure 8).
Figure 8 shows the right edge of the most recent SACK block in Figure 7 increasing as more non-contiguous data is received by the receiver and is acknowledged.
SACK option with two blocks (continued from Figure 7).
Figures 9 and 10 show three and four SACK blocks occurring in the TCP ACK segments respectively.
SACK option with three blocks.
SACK option with four blocks.
How Applications Can Benefit from These TCP Options
In the TCP LW option, the window scale factor is calculated based on the receive window set by the application. If the receive window is less than 64 KB the scale factor will be 0. Even though an LW option would have been negotiated for a connection of this application, no improvement in performance can be experienced.
Hence, applications would need to select a receive buffer size larger than 64 KB to avail themselves of the TCP LW option feature. On the other hand, increasing the buffer size to a value greater than the (bandwidth * delay) product would lead to congestion and cause performance degradation. Therefore, when using the LW option, the (bandwidth * delay) product of the target network has to be kept in consideration. If this value is not known, a less aggressive window size (or receive buffer size) is preferred over a highly aggressive window size.
The SACK option requires additional TCP processing for the receiver and moreso for the data sender. Hence, in case of lower (bandwidth * delay) networks, significant performance improvement cannot be expected even in case of data loss. In case of fatter pipes, SACK shows a significant improvement in performance in the event of data loss only when compared to non-SACK TCP.
Not all applications can benefit from both these TCP options. Obviously, there is no performance benefit in merely increasing the window size when there is not enough data to fill that window. Thus, request-response types of applications such as remote Telnet will not experience any benefit. However, bulk data transfer applications such as FTP (file transfer protocol), remote backup, video broadcasting, proxy servers, and Web servers will see a considerable improvement in performance from both LW and SACK.
Support for These TCP Options on NetWare
These two TCP options are supported in NetWare 6 and in all future versions of NetWare. They can be set using the SET command on NetWare. The SET parameters are:
Set TCP Large Window Option = ON | OFF
Set TCP Sack Option = ON | OFF
Both parameters are ON by default.
Note: Keeping these options ON does not cause performance degradation in NetWare-based LAN environments. Significant performance benefit is expected on high bandwidth WAN type of environments with bulk data transfer applications.
NetWare application developers can easily take advantage of these options. SACK is automatically triggered by TCP and requires no change from the application end. To make use of the Large Window option, application developers can specify a receive buffer size greater than 64 KB. However, the guidelines given in the previous section must be considered when designing the upper limit on the maximum receive buffer size.
Conclusion
NetWare 6 and future versions provide support for two performance extensions to the TCP protocol and conforms to the standards specified by the IETF. These options, Large Window and Selective Acknowledgement, provide benefit in LFN networks. The applications most benefited by these options are TCP bulk data transfer applications.
* Originally published in Novell AppNotes
Disclaimer
The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.