Subj : Enabling High Performance Data Transfers on Hosts
-------------------------------------------------------------------------------
Enabling High Performance Data Transfers on Hosts
Introduction
In order to take advantage of today's high speed networks, hosts must
support and utilize extensions to basic TCP/IP. There are four main
steps required for both the data sender and data receiver:
1. The host systems must use Path MTU Discovery (RFC1191). This
allows systems to use the largest possible packet size, rather
than the default of 512 bytes. On most systems, this feature must
be explicitly enabled by the system administrator. If Path MTU
Discovery is unavailable or undesired, it is sometimes possible to
trick the system into using large packets, but this may have
undesirable side effects.
2. The host systems must support RFC1323 "Large Windows" extensions
to TCP. These extensions enable new features in the TCP/IP
protocols needed for high speed transfers. On some systems,
RFC1323 extensions are included but may require the system
administrator to explicitly turn them on.
3. The host system must support large enough socket buffers for
reading and writing data to the network. Typical Unix systems
include a default maximum value for the socket buffer size between
128 kB and 1 MB. For many paths, this is not enough, and must be
increased. (Without RFC1323 "Large Windows", TCP/IP does not allow
applications to buffer more the 64 kB in the network, which is
inadequate for almost all high speed paths.)
4. The application must set its send and receive socket buffer sizes
(at both ends) to at least the bandwidth*delay product of the
link. (See [1]computing bandwidth*delay products below). Some user
applications support options for the user to set the socket buffer
size (for example, Cray UNICOS FTP); many do not. There are
several modified versions applications available which support
large socket buffer sizes.
+ Retrieve [2]an example of a user tunable version of WU-FTP
and NCFTP from NCAR.
+ NLANR/NCNE maintains a [3]tool repository which includes
application enhancements for several versions of FTP and
rsh. Also included on this site is the nettune library for
performing such enhancements yourself.
Alternatively, the system-wide default socket buffer size can be
raised, causing all applications to utilize large socket buffers.
This is not generally recommended, as many network applications
then consume system memory which they do not require.
New: The best solution would be for the operating system to
automatically tune socket buffers to the appropriate size. Jeff
Semke at PSC has developed an experimental [4]Autotuning
Implementation for NetBSD which does exactly this. In the future,
we hope to see such automatic tuning as a part of all TCP
implementations, making this entire website obsolete.
For socket applications, the programmer can choose the socket
buffer sizes using a setsockopt() system call. A [5]Detailed Users
Guide describing how to set socket buffer sizes within socket
based applications has been put together by Von Welch at NCSA.
In additionto these four steps, TCP Selective Acknowledgments (SACK)
are in the process of being standardized (RFC2018). SACKs allow for
further improvements to efficiency of TCP, both for high bandwidth
networking needs, and in cases of heavy congestion. For completeness,
SACK information is included in the table below. Further information
on commercial and experimental implementations of SACK is available at
[6]http://www.psc.edu/networking/all_sack.html.
_________________________________________________________________
CAPTION:
Support for these features under various operating systems
Operating System (Alphabetical) (Click for additional info) RFC1191
Path MTU Discovery RFC1323 Support Default maximum socket buffer size
Default TCP socket buffer size Default UDP socket buffer size
Applications (if any) which are user tunable RFC2018 SACK Support
[7]More info
[8]BSD/OS 2.0 No Yes 256kB 8kB 9216 snd 41600 rcv None [9]Hari
Balakrishnan's BSD/OS 2.1 implementation
[10]BSD/OS 3.0 Yes Yes 256kB 8kB 9216 snd 41600 rcv None
ConvexOS 11.0 Yes 2400kB
[11]CRI Unicos 8.0 Yes Yes FTP
[12](Compaq) Digital Unix 3.2 Yes Winscale, No Timestamps 128kB 32kB
None
[13](Compaq) Digital Unix 4.0 Yes Yes Winscale, No Timestamps 128kB
32kB 9216 snd 41600 rcv None [14]PSC Research version
[15]FreeBSD 2.1.5 Yes Yes 256kB 16kB 40kB None [16]Luigi Rizzo's
FreeBSD2.1R version
Also Eliot Yan of UCLA has one
[17]FTP Software OnNet Kernel 4.0 for Win95/98 Yes Yes 963.75 MB 8K
[146K for Satellite tuning] 8K send 48K recv FTP server Yes
[18]HPUX 9.X No 9.05 and 9.07 provide patches for RFC1323 1 MB (?)
8kB 9216 FTP (with patches)
[19]HPUX 10.{00,01,10,20,30} Yes Yes 256kB 32kB 9216 FTP
[20]HPUX 11 Yes Yes >31MB? 32kB 65535 FTP
[21]IBM AIX 3.2 & 4.1 No Yes 64kB 16kB 41600 Bytes recieve/9216 Bytes
send None
[22]IBM MVS TCP stack by Interlink, v2.0 or greater No Yes 1MB
Linux 2.0.x Yes No (under development for 2.1.x) 32kB 32kB 32kB
None Available from Theodoros Assimakopoulos (thass@ee.tu-berlin.de)
[23]Linux 2.1.90 or later, including Linux 2.2. Yes Yes 64kB 32kB
(see [24]notes 32kB(?) None SACK (and FACK?) are now part of the 2.1
distribution
[25]MacOS (Open Transport) Yes No
[26]Microsoft Windows NT 3.5/4.0 Yes No 64kB max(~8kB, min(4*MSS,
64kB)) No
Microsoft Windows NT 5.0 Beta Yes Yes
[27]Microsoft Win95 [28]Patch is available with many improvements to
networking support. I have not tried out this patch, but I imagine the
tuning instructions for Win98 will be helpful if you use it.
[29]Microsoft Win98 Yes 1GB(?!) 8kB Yes (on by default)
[30]NetBSD 1.1/1.2 No Yes 256kB 16kB None [31]PSC Research version
Novell Netware5 Yes No 64kB 31kB None
[32]SGI IRIX 5.3 Yes Yes 512kB 60kB None
[33]SGI IRIX 6.1 Yes Yes 1MB 60kB None
[34]SGI IRIX 6.2 Yes Yes Unlimitted 60kB None
[35]SGI IRIX 6.5 Yes Yes Unlimitted 60kB 60kB None Yes, as of 6.5.7.
It is on by default.
[36]SunOS 4.1.4 No No. However, can be purchased as a Sun Consulting
Special. 52kB 4kB 9000 bytes Send, 18032 bytes Receive None
[37]Sun Solaris 2.5 Yes No. However, can be purchased as a Sun
Consulting Special, and will be in Solaris 2.6 256kB 8kB 8kB None
[38]Sun Solaris 2.6 Yes Yes 1MB TCP, 256kB UDP 8kB 8kB None Yes,
[39]experimental patch from Sun
[40]Sun Solaris 7 Yes Yes 1MB TCP, 256kB UDP 8kB 8kB None Yes; default
is "passive". (See [41]below)
Operating System (Alphabetical) (Click for additional info) Path MTU
Discovery RFC1323 Support Default maximum socket buffer size Default
TCP socket buffer size Default UDP socket buffer size Applications
(if any) which are user tunable SACK Support
_________________________________________________________________
Computing Bandwidth*Delay ProductsThe peak bandwidth of the link is typically
expressed in Mbit/s, and for the vBNS network is approximately 120 Mbit/s. The
round-trip delay for a link can be measured with traceroute, and for high-speed
WAN links is typically between 10 msec and 100 msec. For a 60 msec, 120 Mbit/s
path, the bandwidth*delay product would be 7200 kbit, or 900 kByte.
_________________________________________________________________
Additional detailed procedures for system tuning under various operating
systems
Procedure for raising network limits under BSD/OS 2.1 and 3.0 (BSDi)
MTU discovery is now supported in BSD/OS 3.0. RFC1323 is also
supported, and the procedure for setting the relevant kernel variable
uses the "sysctl" interface described for [42]FreeBSD. See sysctl(1)
and sysctl(3) for more information.
_________________________________________________________________
Procedure for raising network limits under CRI systems under Unicos 8.0
System configuration parameters are tunable via the command
"/etc/netvar". Running "/etc/netvar" with no arguments shows all
configurable variables:
% /etc/netvar
Network configuration variables
tcp send space is 32678
tcp recv space is 32678
tcp time to live is 60
tcp keepalive delay is 14400
udp send space is 65536
udp recv space is 68096
udp time to live is 60
ipforwarding is on
ipsendredirects is on
subnetsarelocal is on
dynamic MTU discovery is on
adminstrator mtu override is on
maximum number of allocated sockets is 3750
maximum socket buffer space is 409600
operator message delay interval is 5
per-session sockbuf space limit is 0
The following variables can be set:
* dynamic MTU discovery: This is "off" by default and should be
changed to "on".
* maximum socket buffer space: This should be set to the desired
maximum socket buffer size (in bytes).
* tcp send space, tcp recv space: These are the default buffer sizes
used by applications. These should be changed with caution.
Once variables have been changed in by /etc/netvar, they take effect
immediately for new processes. Processes which are already running
with open sockets are not modified.
_________________________________________________________________
Procedure for raising network limits on (Compaq) DEC Alpha systems under
Digital Unix 3.2c
* By default, the maximum allowable socket buffer size on this
operating system is 128kB.
* In order to raise this maximum, you must increase the kernel
variable sb_max. In order to do this, run the following commands
as root:
# dbx -k /vmunix
(dbx) assign sb_max = (u_long) 524288
(dbx) patch sb_max = (u_long) 524288
In this example, sb_max is increased to 512kB. The first command
changes the variable for the running system, and the second
command patches the kernel so it will continue to use the new
value, even after rebooting the system. Note, however, that
reinstalling (overwriting) the kernel will undo this change.
* The Digital Unix manuals also recommend increasing mbclusters to
at least 832.
* Standard applications do not have a mechanism for setting the
socket buffer size to anything but the default. However, you can
change the kernel default by modifying the kernel variables
(tcp_sendspace, tcp_recvspace)
# /sbin/sysconfig -r inet tcp_sendspace 65536
# /sbin/sysconfig -r inet tcp_recvspace 65536
* Specific advice for tuning (Compaq) Digital UNIX systems (for both
V4.0 releases and many of the V3.2x releases) may be found at
[43]http://www.unix.digital.com/internet/tuning.htm
This document contains information on other important parameters
(not just the ones directly associated with the socket, IP, and
TCP layers) and gives instructions on how to modify things. It
also includes important patch information, and is updated every
few months.
_________________________________________________________________
Procedure for raising network limits under FreeBSD 2.1.5
MTU discovery is on by default in FreeBSD past 2.1.0-RELEASE. If you
wish to disble MTU discovery, the only way that we know is to lock an
interface's MTU, which disables MTU discovery on that interface.
You can't modify the maximum socket buffer size in FreeBSD
2.1.0-RELEASE, but in 2.2-CURRENT you can use
sysctl -w kern.maxsockbuf=524288
to make it 512kB (for example). You can also set the TCP and UDP
default buffer sizes using the variables
net.inet.tcp.sendspace
net.inet.tcp.recvspace
net.inet.udp.recvspace
_________________________________________________________________
Procedure for raising network limits under FTP Software OnNet 4.0 for
Win95/98
OnNet Kernel has a check box "Enable Satellite tuning" which was
intended and tested for 2Mb Satellite link with 600ms delay. This sets
tcp window to 146K.
Many default settings, all of the above and more, may be overriden
with registry entries. We plan to make available tuning guidelines at
"some future time". Also default TCP window may be set with Statistics
app which is installed with OnNet Kernel.
The product "readme" discusses changing TCP window size and Initial
slow start threshold with the Windows registry.
Statistics also has interesting graphs of TCP/UDP/IP/ICMP traffic.
Also IPtrace app is shipped with OnNet Kernel to view unicast /
multicast / broadcast traffic (no unicast traffic for other hosts - it
does not run in promiscuous mode).
_________________________________________________________________
Procedure for raising network limits under HPUX 9.X
HP-UX 9.X does not support Path MTU discovery.
There are patches for 9.05 and 9.07 that provide 1323 support. To
enable it, one must poke the kernel variables tcp_dont_tsecho and
tcp_dont_winscale to 0 with adb (the patch includes a script, but I
don't recall the patch number).
Without the 9.05/9.07 patch, the maximum socket buffer buffer size is
somewhere around 58254 bytes. With the patch it is somewhere around
1MB (there is a small chance it is as much as 4MB).
The FTP provided with the up to date patches should offer an option
to change the socket buffer size. The default socket buffer size for
this could be 32KB or 56KB.
There is no support for SACK in 9.X.
Procedure for raising network limits under HPUX 10.X
HP-UX 10.00, 10.01, 10.10, 10.20, and 10.30 supports Path MTU
discovery. It is on by default for TCP, and off by default for UDP.
On/Off can be toggled with nettune.
Up through 10.20, RFC 1323 support is like the 9.05 patch, except the
maximum socket buffer size is somewhere between 240 and 256KB. In
other words, you need to do the same adb "pokes" as described above.
10.30 does not require adb "pokes" to enable RFC1323. 10.30 also
replaces nettune with ndd. The 10.X default TCP socket buffer size is
32768, the default UDP remains unchanged from 9.X. Both can be tweaked
with nettune.
FTP should be as it is in patched 9.X.
There is no support for SACK in 10.X up through 10.20.
Procedure for raising network limits under HPUX 11
HP-UX 11supports PMTU discovery and enables it by default. This is
controlled through the ndd setting ip_pmtu_strategy.
RFC 1323 support is enabled automagically in HP-UX 11. If an
application requests a window/socket buffer size greater than 64 KB,
window scaling and timestamps will be used automatically.
The default TCP window size in HP-UX 11 remains 32768 bytes and can be
altered though ndd and the settings:
tcp_recv_hiwater_def
tcp_recv_hiwater_lfp
tcp_recv_hiwater_lnp
tcp_xmit_hiwater_def
tcp_xmit_hiwater_lfp
tcp_xmit_hiwater_lnp
FTP in HP-UX 11 uses the new sendfile() system call. This allows data
to be sent directly from the filesystem buffer cache through the
network without intervening data copies.
Support for SACK in HP-UX 11 is currently (2/26/99) under
investigation.
Here is some ndd -h parm output for a few of the settings mentioned
above. For those not mentioned, use ndd -h on an HP-UX 11 system, or
consult the online manuals at [44]http://docs.hp.com/
# ndd -h ip_pmtu_strategy
ip_pmtu_strategy:
Set the Path MTU Discovery strategy: 0 disables Path MTU
Discovery; 1 enables Strategy 1; 2 enables Strategy 2.
Because of problems encountered with some firewalls, hosts,
and low-end routers, IP provides for selection of either
of two discovery strategies, or for completely disabling the
algorithm. The tunable parameter ip_pmtu_strategy controls
the selection.
Strategy 1: All outbound datagrams have the "Don't Fragment"
bit set. This should result in notification from any intervening
gateway that needs to forward a datagram down a path that would
require additional fragmentation. When the ICMP "Fragmentation
Needed" message is received, IP updates its MTU for the remote
host. If the responding gateway implements the recommendations
for gateways in RFCM- 1191, then the next hop MTU will be included
in the "Fragmentation Needed" message, and IP will use it.
If the gateway does not provide next hop information, then IP
will reduce the MTU to the next lower value taken from a table
of "popular" media MTUs.
Strategy 2: When a new routing table entry is created for a
destination on a locally connected subnet, the "Don't Fragment"
bit is never turned on. When a new routing table entry for a
non-local destination is created, the "Don't Fragment" bit is
not immediately turned on. Instead,
o An ICMP "Echo Request" of full MTU size is generated and
sent out with the "Don't Fragment" bit on.
o The datagram that initiated creation of the routing table
entry is sent out immediately, without the "Don't Fragment"
bit. Traffic is not held up waiting for a response to the
"Echo Request".
o If no response to the "Echo Request" is received, the
"Don't Fragment" bit is never turned on for that route;
IP won't time-out or retry the ping. If an ICMP "Fragmentation
Needed" message is received in response to the "Echo Request",
the Path MTU is reduced accordingly, and a new "Echo Request"
is sent out using the updated Path MTU. This step repeats as
needed.
o If a response to the "Echo Request" is received, the
"Don't Fragment" bit is turned on for all further packets
for the destination, and Path MTU discovery proceeds as for
Strategy 1.
Assuming that all routers properly implement Path MTU Discovery,
Strategy 1 is generally better - there is no extra overhead for the
ICMP "Echo Request" and response. Strategy 2 is available
only because some routers, or firewalls, or end hosts have been
observed simply to drop packets that have the DF bit on without
issuing the "Fragmentation Needed" message. Strategy 2 is more
conservative in that IP will never fail to communicate when using
it. [0,2] Default: Strategy 2
# ndd -h tcp_recv_hiwater_def | more
tcp_recv_hiwater_def:
The maximum size for the receive window. [4096,-]
Default: 32768 bytes
# ndd -h tcp_xmit_hiwater_def
tcp_xmit_hiwater_def:
The amount of unsent data that triggers write-side flow control.
[4096,-] Default: 32768 bytes
HP has detailed networking performance information online, including
information about the "netperf" tool and a large database of system
performance results obtained with netperf:
Procedure for raising network limits on IBM RS/6000 systems under AIX 3.2 or
AIX 4.1
RFC1323 options and defaults are tunable via the "no" command.
See the "no" man page for options; additional information is available
in the IBM manual AIX Versions 3.2 and 4.1 Performance Tuning Guide,
which is available on AIX machines through the InfoExplorer hypertext
interface.
_________________________________________________________________
Procedure for raising network limits on IBM MVS systems under the Interlink
TCP stack
The default send and receive buffer sizes are specified at startup,
through a configuration file. The range is from 4K to 1MByte. The
syntax is as follows:
* TCP SCALE(4) - specifies to support window scaling of 4 bits.
Range is 0 (suppress both window scaling and timestamps) to 14
bits.
If SCALE is not zero, and the user bufferspace is > 65535,
negotiating window scaling and timestamps will be attempted.
If SCALE is not zero, and the remote user negotiates window
scaling or timestamps, we will accept those requests.
* FTP IBUF(4 20480) - would specify a receive bufferspace of 81920
bytes, and thus eligible for window scaling and timestamps.
FTP and user programs can be configured to use Window Scaling and
Timestamps. This is done through the use of SITE commands:
* QUOTE SITE IBUF(num size) - specifies the input bufferspace for
file transfers. When the product is larger than 65535, negotiating
window scaling and timestamps will be attempted (if SCALE is not
zero).
_________________________________________________________________
Procedure for raising network limits on Linux systems for 2.1.100 or greater.
Note: Linux only allows you to use 15 bits of the TCP window field.
The net effect of this is that you need to multiply everything by 2,
or recompile the kernel without this limitation. See "Tuning at
compile time" below.
Tuning a running system
There is no sysctl application for changing values, but you can change
the values very easy with a editor like vi. Simply edit the files
listed below, which magically change the values in the kernel.
Tuning the default and maximum window size:
/proc/sys/net/core/rmem_default - default receive window
/proc/sys/net/core/rmem_max - maximum receive window
/proc/sys/net/core/wmem_default - default send window
/proc/sys/net/core/wmem_max - maximum send window
In /proc/sys/net/ipv4/ you will find some other possibilities to tune
TCP:
tcp_timestamps
tcp_windowscaling
tcp_sack
...
You will find a short description
in /Documentation/networking/ip-sysctl.txt
Tuning at compile time
All the above values are set default by a header file in the Linux
kernel source directory:
/LINUX-SOURCE-DIR/include/linux/skbuff.h
/* These are just the default values. This is run time configurable.
* FIXME: Probably the config option should go away. - erics
*/
#ifdef CONFIG_SKB_LARGE
#define SK_WMEM_MAX 65535
#define SK_RMEM_MAX 65535
#else
#define SK_WMEM_MAX 32767
#define SK_RMEM_MAX 32767
#endif
Also in the Linux kernel source directory:
/LINUX-SOURCE-DIR/include/net/tcp.h
you can change the MAX-WINDOW value
/*
* Never offer a window over 32767 without using window scaling. Some
* poor stacks do signed 16bit maths!
*/
#define MAX_WINDOW 32767
#define MIN_WINDOW 2048
This last item is what limits you to using only 15 bits of the window
field in the TCP packet header. Suppose you wish to use a window of 40
kB. If you simply set the rmem_default to 40 kB, the stack will
recognize that this is less than 64 kB and therefore will not
negotiate a winshift. However, because of this second check, you will
only get 32 kB. Therefore, you need to set the rmem_default to
something larger than 64 kB in order to force a winshift=1, which then
lets you express the desired 40 kB in only 15 bits (and in fact you'll
probably then end up with 64 kB whether you want it or not).
I imagine that a better idea is to simply change this value for
MAX_WINDOW to 65535 if you need windows larger than 32 kB. I haven't
tested this out to see how well it works. Alas, this part of the code
is somewhat hard to follow. I'd appreciate any comments on how well
this works.
User testimonial: With the tuned TCP stacks it was possible to get a
maximum throughput between 1.5 - 1.8 Mbit/s via a 2Mbit/s satellite
link, measured with netperf.
_________________________________________________________________
Information about tuning for MacOS
I don't have detailed information, however, someone pointed me to a
good website with useful information. The URL is
[46]http://www.sustworks.com/products/product_otat.html. I don't
endorse the product they are selling (since I've never tried it).
However, it is available for a free trial, and they appear to do an
excellent job of describing perf-tune issues for Macs.
_________________________________________________________________
Procedure for raising network limits under Microsoft Windows 98
New: Some folks at NLANR/MOAT in SDSC have written a tool to do guide
you through some of this stuff. It can be found at
[47]http://moat.nlanr.net/Software/TCPtune/.
Even newer: I've updated some sending window information which was
inaccurate. See [48]below.
Several folks have recently helped me to figure out how to accomplish
the necessary tuning under Windows98, and the features do appear to
exist and work. Thanks to everyone for the assistance! The new
description below should be useful to even the complete Windows novice
(such as me :-).
Windows98 includes implementation of RFC1323 and RFC2018. Both are on
by default. (However, with a default buffer size of only about 8kB,
window scaling doesn't do much).
Windows stores the tuning parameters in the Windows Registry. In the
registry are settings to toggle on/off Large Windows, Timestamps, and
SACK. In addition, default socket buffer sizes can be specified in the
registry.
In order to modify registry variables, do the following steps:
1. Click on Start -> Run and then type in "regedit". This will fire
up the Registry Editor.
2. In the Registry Editor, double click on the appropriate folders to
walk the tree to the parameter you wish to modify. For the
parameters below, this means clicking on HKEY_LOCAL_MACHINE ->
System -> CurrentControlSet -> Services -> VxD -> MSTCP.
3. Once there, you should see a list of parameters in the right half
of your screen, and MSTCP should be highlighted in the left half.
The parameters you wish to modify will probably not appear in the
right half of your screen; this is OK.
4. In the menu bar, Click on "Edit -> New -> String Value". It is
important to create the parameter with the correct type. All of
the parameters listed below are strings.
5. A box will appear with "New Value #1"; change the name to the name
listed below, exactly as shown. Hit return.
6. On the menu, click on "Edit -> Modify" (your new entry should
still be selected). Then type in the value you wish to assign to
the parameter.
7. Exit the registry editor, and reboot windows. (The rebooting is
important, *sigh*.)
8. When your system comes back up, you should have access to the
features you have just turned on. The only real way to verify this
is through packet traces (or by noticing a significant performance
improvement).
TCP/IP Stack Variables
Support for TCP Large Windows (TCPLW)
Win98 TCP/IP supports TCP large windows as documented in RFC 1323. TCP
large windows can be used for networks that have large bandwidth delay
products such as high-speed trans-continental connections or satellite
links. Large windows support is controlled by a registry key value in:
HKLMsystemcurrentcontrolsetservicesVXDMSTCP
The registry key Tcp1323Opts is a string value type. The values for
Tcp1323Opt are
Value Meaning
0 No Windowscaling and Timestamp Options
1 Window scaling but no Timestamp options
3 Window scaling and Time stamp options
The default value for Tcp1323Opts is 3: Window Scaling and Time stamp
options. Large window support is enabled if an application requests a
Winsock socket to use buffer sizes greater than 64K. The current
default value for TCP receive window size in Memphis TCP is 8196
bytes. In previous implementations the TCP window size was limited to
64K, this limit is raised to 2**30 through the use of TCP large window
support.
Support for Selective Acknowledgements (SACK)
Win98 TCP supports Selective Acknowledgements as documented in RFC
2018. Selective acknowledgements allow TCP to recover from IP packet
loss without resending packets that were already received by the
receiver. Selective Acknowledgements is most useful when employed with
TCP large windows. SACK support is controlled by a registry key value
in:
HKLMsystemcurrentcontrolsetservicesVXDMSTCP
The registry key SackOpts is a string value type. The values for
SackOpts are
Value Meaning
0 No Sack options
1 Sack Option enabled
Support for Fast Retransmission and Fast Recovery
Win98 TCP/IP supports Fast Retransmission and Fast Recovery of TCP
connections that are encountering IP packet loss in the network. These
mechanisms allow a TCP sender to quickly infer a single packet loss by
reception of duplicate acknowledgements for a previously sent and
acknowledged TCP/IP packet. This mechanism is useful when the network
is intermittently congested. The reception of 3 (default value)
successive duplicate acknowledgements indicates to the TCP sender that
it can resend the last unacknowledged TCP/IP packet (fast retransmit)
and not go into TCP slow start due to a single packet loss (fast
recovery). Fast Retransmission and Recovery support is controlled by a
registry key value in:
The registry key MaxDupAcks is DWORD taking integer values from 2 to
N. If MaxDupAcks is not defined, the default value is 3.
Update: If you wish to set the default receiver window for
applications, you should set the following key:
DefaultRcvWindow
HKLMsystemcurrentcontrolsetservicesVXDMSTCP
DefaultRcvWindow is a string type and the value describes the default
receive windowsize for the TCP stack. Otherwise the windowsize has to
be programmed in apps with setsockopt.
For a long time, I had the following sentence on this page:
* I presume that there is also a DefaultSndWindow which you would
want to use on servers sending data to get higher performance. I
have not yet verified this, however.
It turns out that there is not in fact such a variable. My limited
experience has shown that, in some cases, it is possible to see very
large send windows from Microsoft boxes. However, recent reports on
the tcpsat mailing list have also stated that a number of applications
under Windows severely limit the sending window. These applications
appear to include FTP and possibly also the CIFS protocol which is
used for file sharing. With these applications, it appears to be
impossible to exceed the performance limit dictated by this sending
window.
If anyone has any further information on these specific applications
under Windows, I would be happy to include it here.
_________________________________________________________________
Misc Info about Windows NT
Editor's note: See Windows 98 above for a detailed description of how
this all works. In NT land, the Registry Editor is called regedt32.
Any Registry Values listed appear in:
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Receive Window
maximum value = 64kB, since window scaling is not supported
default value = min( max( 4 x MSS,
8kB rounded up to nearest multiple of MSS),
64kB)
Registry Value:
TcpWindowSize
Path MTU Discovery Variables:
EnablePMTUDiscovery (default = enabled)
turn on/off path MTU discovery
EnablePMTUBHDetect (default = disabled)
turn on/off Black Hole detection
Using Path MTU Discovery:
EnablePMTUDiscovery REG_DWORD
Range: 0 (false) or 1 (true)
Default: 1
Determines whether TCP uses a fixed, default maximum transmission unit
(MTU) or attempts to find the actual MTU. If the value of this entry is
0, TCP uses an MTU of 576 bytes for all connections to computers outside
of the local subnet. If the value of this entry is 1, TCP attempts to
discover the MTU (largest packet size) over the path to a remote host.
Using Path MTU Discovery's "Blackhole Detection" algorithm:
EnablePMTUBHDetect REG_DWORD
Range: 0 (false) or 1 (true)
Default: 0
If the value of this entry is 1, TCP tries to detect black hole routers
while doing Path MTU Discovery. TCP will try to send segments without
the Don't Fragment bit set if several retransmissions of a segment go
unacknowledged. If the segment is acknowledged as a result, the MSS will
be decreased and the Don't Fragment bit will be set in future packets on
the connection.
I received the following additional notes about the Windows TCP
implementation.
PMTU Discovery. If PMTU is turned on, NT 3.1 cannot cope with routers
that have the BSD 4.2 bug (see RFC 1191, section 5). It loops
resending the same packet. Only confirmed on NT 3.1.
_________________________________________________________________
Misc Info about Windows 95
Editor's note: See Windows 98 above for more detailed descriptions of
how this all works. I haven't personally tested the Win95 info below.
New: A Patch is available for Win95 at the following URL:
[49]http://support.microsoft.com/support/kb/articles/q182/1/08.asp.
This patch includes support for TCPLW and SACK. I haven't tried it
out, but I assume that the info above on tuning Win98 will be useful.
Any Registry Values listed appear in:
HKEY_LOCAL_MACHINESystemCurrentControlSetServicesVxDMSTCP
Receive Window
maximum value = 64kB, since window scaling is not supported
default value = min( max( 4 x MSS,
8kB rounded up to nearest multiple of MSS),
64kB)
(See NT for more info on using PMTU discovery and black hole
detection).
I received the following additional notes about the Windows TCP
implementation.
TCP retries. Not strictly performance related but a common cause of
TN3270 emulators dropping their sessions if the mainframe is busy for
a second or two. Instead of retrying up to 240 seconds (RFC 1122,
section 4.2.3.1), Windows 3.11 and 95 default to 5 retries without a
time limit. Even with RTO doubling, on a fast link 5 retries gives up
after less than a second of no response.
Hkey_Local_MachineSystemCurrentControlSetServicesVxDMSTCP, add
variable MaxDataRetries. I normally set it to 64.
_________________________________________________________________
Procedure for raising network limits under NetBSD
RFC1323 is on by default in NetBSD 1.1 and above. Under NetBSD 1.2, it
can be verified to be on by typing:
sysctl net.inet.tcp.rfc1323
The maximum socket buffer size can be modified by changing SB_MAX in
/usr/src/sys/sys/socketvar.h.
The default socket buffer sizes can be modified by changing
TCP_SENDSPACE and TCP_RECVSPACE in /usr/src/sys/netinet/tcp_usrreq.c.
It may also be necessary to increase the number of mbufs, NMBCLUSTERS
in /usr/src/sys/arch/*/include/param.h.
Update: It is also possible to set these parameters in the kernel
configuration file.
options SB_MAX=1048576 # maximum socket buffer size
options TCP_SENDSPACE=65536 # default send socket buffer size
options TCP_RECVSPACE=65536 # default recv socket buffer size
options NMBCLUSTERS=1024 # maximum number of mbuf clusters
_________________________________________________________________
Procedure for raising network limits under SGI systems under IRIX 5.3 or 6.1
All of the necessary kernel variables are included in the file:
/var/sysgen/master.d/bsd
The following variables are available to enable control high speed
transfers:
* tcp_mtudisc: To enable MTU Discovery, set the variable tcp_mtudisc
= 1. [Note: I have received reports of poor performance with MTU
discovery on the Iconwest/Phobos G100 fast ethernet card. I would
like to hear more about this problem if anyone else is
experiencing it.]
* tcp_sendspace, tcp_recvspace: To increase the default socket
buffer size for TCP, set the variables tcp_sendspace and
tcp_recvspace to the desired value (in byes). Under IRIX 5.3, the
maximum socket buffer size allowed is 512 kB. Under IRIX 6.x this
limit has been increased to 1 MB, and under future releases it is
rumored that there will be (effectively) no limit on these TCP
socket buffer size. [Note: Under IRIX 6.2, the comment in the bsd
file says there is still a limit of 512 kB. We are running 1 MB on
some of our systems here, and I am searching out the answer on
what the real max is.]
* tcp_winscale controls the use of RFC1323 winshift. It is turned on
by default and need not be modified.
Once you have editted this file, you must configure a new kernel
(using /etc/autoconfig) and reboot the system with it.
Only slightly related to this page, SGI [50]Hippi performance info.
_________________________________________________________________
Procedure for raising network limits under SGI systems under IRIX 6.5
Under this version, there are two locations where configuration is
done. The BSD values are now stored in /var/sysgen/mtune/bsd.
For instance from the file:
* name default minimum maximum
*
* TCP window sizes/socket space reservation; limited to 1Gbyte by RFC
1323
*
tcp_sendspace 61440 2048 1073741824
tcp_recvspace 61440 2048 1073741824
These variables are used similarly to earlier IRIX 5 and 6 versions.
There is also a systune command. This command allows you to configure
other networking variables. Here is a sample of things which can be
tuned using systune:
/usr/sbin/systune (which is like sysctl for BSD) is what you use for
tuneable values.
Finally, the tcp_sendspace and tcp_recvspace can be tuned on a
per-interface basis using the rspace and sspace options to ifconfig.
Editors note: I haven't personally used an IRIX 6.5 system, but
looking at this information, I suppose that you'd want to edit the BSD
file for a permanent kernel change which will last across reboots. For
a less permanent change, you should probably use the systune command.
I guess another way to make a permanent change would be to add
something to one of the rc files which run at boot time.
SACK: As of 6.5.7, SACK is included in the IRIX operating system and
is on by default.
_________________________________________________________________
Procedure for raising network limits under SunOS 4.1.4
The default socket buffer sizes are set in the file
/sys/netinet/in_proto.c. Edit the file and then rebuild the kernel
for changes to take affect.
_________________________________________________________________
Procedure for raising network limits under Solaris 2.5
The ndd variable tcp_xmit_hiwat is used to determine the default
SO_SNDBUF size.
The ndd variable tcp_recv_hiwat is used to determine the default
SO_RCVBUF size.
The ndd variable tcp_max_buf specifies the maximum socket buffer size.
(Note: I believe xxx should be specified in bytes)
The ndd variable ip_path_mtu_discovery controls the use of path MTU
discovery. The default value is 1, which means on.
Note that ndd can also be used to increase the volume of TCP
connections available to a machine.
ndd -set /dev/tcp tcp_conn_req_max
(where is greater than 32 (the default) but less (or equal to) 1024).
This may help if your network traffic is comprised of many small
streams rather than just a few large streams.
In Solaris 2.6; and also 2.5 and 2.5.1 with newer tcp patches, the
tcp_conn_req_max ndd setting has been removed, and split into two new
settings:
tcp_conn_req_max_q
default value = 128
number of connections in ESTABLISHED state
(3-way handshake completed; not yet accepted)
tcp_conn_req_max_q0
default value = 1024
number of connections in SYN_RCVD state
SACK is now available in an experimental release for Solaris 2.6. To
obtain it, see [51]ftp://playground.sun.com/pub/sack/tcp.sack.tar.Z
Additional Info about recent versions of solaris can be found at
[52]http://www.rvs.uni-hannover.de/people/voeckler/tune/EN/tune.html#t
hp
_________________________________________________________________
Details about SACK under Solaris 7
Solaris 7 includes SACK, which is on in "passive" mode by default.
That means it is enabled only if the other side sends sackok in the
initial SYN. To make it active, set tcp_sack_permitted to 2. The
default is 1. To completely disable SACK, set tcp_sack_permitted to 0.
The tcp_sack_permitted variable can be set using the ndd command as
described below. Other kernel variables remain the same under Solaris
7 as they were in 2.5.
_________________________________________________________________