A lesson in TCP routing

In my work as Linux and network administrator my boss and I recently had occasion to learn how the routing of TCP traffic over the Internet in 2014 works very differently from what we learnt one or two decades earlier.

TCP/IP stands for Transmission Control Protocol / Internet Protocol and is the standard by which client server communication happens over the Internet. A client - workstation or web browser on a workstation - initiates a connection to a server - an Apache or IIS application on a server computer across the Internet. They exchange streams of bytes, each end acting both as source and destination of such a stream. The source chops up its stream of bytes into packets, each packet finds its way across the Internet to the destination, where the packets are collected, put back in sequence and the stream of bytes extracted.

"The Internet" is an internet, and an internet is a combination of one or more local networks (usually Ethernets). The local networks are interconnected by "routers". The source of traffic sends the traffic to the nearest router, it sends it onward on a different network that leads closer to the destination. After a number of such hops from one local network onto another the packets eventually arrive at the destination.

The problem

The problem isn't really important. But the subtlety of the cause and the magnitude of the effect kicked off an investigation that overturned a few long-held ideas of how packets are routed between networks that we had in our heads.

The ingredients to observe the problem are:

A Windows client workstation running a web browser. The MTU in its network interface is set correctly to 1500, and like all modern operating systems, path MTU discovery is on. The MTU is the size limit for an Ethernet frame on the local network. This limits the size of an Internet packet or packet fragment. [1]
A web server.
A router running Debian Linux with kernel 3.2.60 as announced in Debian Security Advisory 2972 [2]. The router has at least two network interfaces. The default settings for the network interfaces are to have TCP segmentation offload turned on. The term offload refers to the fact that the operating system allows the network hardware to do the work, it offloads the work to the hardware [3].
Two Ethernets, one connecting the client to the router on one interface and one connecting the server to the router on the other interface. Both Ethernets use standard frame sizes so that MTU settings of 1500 are correct everywhere in the experimental setup. [1]

The client sends a small volume of traffic to ask for a web page. The server then responds with a large volume of traffic to deliver the web page to the client. However, the router refuses to forward almost all traffic from the server, alleging that the packets are too big to be sent on via the Ethernet that leads to the client. The server cannot figure out how to react to the error messages from the router and hardly any of its traffic makes it through to the client.

The problem is quite peculiar and not really important here. There are many measures one could take to avoid the problem, each on its own avoids the problem:

Use a Linux client. Note that it is something the client says to the server that causes the server's traffic to fail; this is perplexing.
Use a smaller MTU on the Windows client. We discovered this by accident, because one client had a Cisco VPN client installed, which had changed the MTU to 1300. Again, note the client settings affect the server's traffic success rate.
Turn path MTU discovery off on the Windows client. This has two effects. First, the MTU is reduced to 536. Second the packets sent by the client no longer have the DF (do not fragment) flag set. [4]
Even though the web server encounters the problem and receives the error messages from the router, the details of the server do not matter. It can be Windows or Linux, Apache or IIS.
More understandable is that changes on the router eliminate the problem: Going back to kernel 3.2.57 avoids the problem. Turning off all offloading to the network cards avoids the problem.

Lessons learnt

TCP handshake and flow control

At the start of a TCP connection there is a three-way handshake. The client sends an empty packet with the SYN (synchronise parameters) flag set. The server replies with an empty packet with SYN and ACK (acknowledge receipt) set. The client responds with a packet with ACK set. [5]

What we had not really realised is how much other information the client and server are exchanging about each other in the handshake.

The handshake probably contains information sent from the client to the server that in the case of a Windows client then causes the problem.

Noteworthy is the MSS parameter that is communicated in the initial handshake. This is the "maximum segment size", i.e. the largest packet size the sender is willing to receive. We find this typically set to 1460, which is the sender's local MTU (1500) minus the length of the IP or TCP header (40). [6]

Also noteworthy is the win parameter. This is communicated in all packets throughout the connection and changes during the connection. It tends to start out fairly large (14600 for a Linux sender, 8192 for a Windows sender) but reduces later to something around the 500 mark. The main purpose of this parameter seems to be for the recipient of the bulk data to signal to the sender that transmission should be stopped for a while to allow the receiver to process the data received so far. [7]

Further, there is a wscale parameter, communicated in the handshake only. win characterises how much data the receiver can receive next, and is used to inhibit transmission while the receiver processes. wscale addresses the problem of using the bandwidth efficiently when the connection is long distance and high bandwidth. The sender should then send more traffic before expecting an acknowledgement. To do so sender and receiver have another buffer to keep larger amounts of traffic while it is also on the network.

In our experiment we saw win=14600,wscale=6 from a Linux client and win=8192,wscale=8 from a Windows server. The larger window sizes therefore are [8]

W_L = 14600 · 2⁶ = 934400
W_W = 8192 · 2⁸ = 2097152

MTU variations and fragmentation

Traditionally, the sender of packets would use its local MTU as the maximum size of the packets it sends. As the packet hops from router to router, it may happen that a router cannot pass it on because the MTU on the next network is too small. What then would normally happen was that the router would split the packet into fragments. Only the first fragment would retain the full IP header. But all fragments would need new header information to identify the packet and each fragment's order in the sequence of fragments. [9]

Fragments would then be reassembled into packets at the destination, not on a router. [9]

It has always been possible for the sender to flag a packet as "DF" or "do not fragment". If a router is asked to forward such a packet, but cannot, it will send a fragmentation-required error message to the source of the packet, telling it what the MTU value is that causes the dilemma. The source can then interpret the message and resend smaller packets. [9]

Path MTU discovery and DF flag

Several changes in typical behaviour have occurred over time:

All modern operating systems now try to discover the "path MTU". They would like to learn the smallest MTU on the networks that need to be traversed from source to destination. The sender can then compose small enough packets to begin with and the complications and inefficiencies of fragmentation can be avoided. To make path MTU discovery, the sender will send all packets with DF flag set. No router can ever fragment a packet and must always send back an error message. The sender uses the error message iteratively to learn the smallest MTU along the path to the destination. It will then make packets small enough to slip through without fragmentation. [4]
Badly configured firewalls may block all ICMP traffic. The fragmentation-required message is ICMP traffic. Hence such firewalls break path MTU discovery. [4]
Firewalls re-assemble packets from fragments in order to carry out their rule checks. Firewalls are routers and traditionally would not have done this. [9]

These changes are at odds with each other. In the absence of fragmentation-required messages, path MTU discovery can be carried out differently with increasing packet sizes and by detecting resulting throughput problems. [4]

tcpdump and offload

In diagnosing our problem our main tool was to record and inspect traffic as it enters and exits the router, using the tcpdump utility. To our surprise, packets received and sent were too large to fit the MTU, typically small integer multiples of 1460 (MTU minus IP header).

tcpdump inspects packets as they are handed from the network hardware to the operating system. It turns out that the large packets seen by the OS kernel are an artefact of "offload". With offload on, the network hardware does not exchange traffic frame by frame with the operating system. Rather, it combines multiple received frames and splits outgoing traffic into multiple sent frames. [10,11,12]

For a while we surmised that the network card was carrying out defragmentation. In this mental image, the sender of the traffic would have sent a packet far exceeding the network MTU, its sending network interface would have split the packet into fragments, and the receiving network interface would have reassembled the fragments into the original packet. But this did not make sense, as all traffic was marked DF - do not fragment. [9]

Using the additional ethtool utility, we found the network interfaces configured thusly:

ethtool -k eth1
  Features for eth1:
  rx-checksumming: on
  tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-unneeded: off [fixed]
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
  scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
  tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: on
  udp-fragmentation-offload: off [fixed]
  generic-segmentation-offload: on
  generic-receive-offload: on
  large-receive-offload: off [fixed]
  rx-vlan-offload: on
  tx-vlan-offload: on
  ntuple-filters: off [fixed]
  receive-hashing: on
  highdma: on [fixed]
  rx-vlan-filter: off [fixed]
  vlan-challenged: off [fixed]
  tx-lockless: off [fixed]
  netns-local: off [fixed]
  tx-gso-robust: off [fixed]
  tx-fcoe-segmentation: off [fixed]
  fcoe-mtu: off [fixed]
  tx-nocache-copy: on
  loopback: off [fixed]

It turns out that between the upper layers of the IP stack in the operating system and the network hardware, each large packet of the upper layers correspond to multiple packets (not fragments) at the lower layer. The smaller packets all fit the network MTU, while the larger packet at the higher levels is more efficient to process. [10,11,12]

The offloading may merely shift the work of splitting and reassembling large packages from the operating system to the network card. But this also shifts the work such that tcpdump in one case sees the small MTU-matching packets and in the other it sees the large efficiently processable packets.

Since they are all packets, the fact that they all have the DF frag does formally not matter. The small units are full IP packets and not fragments of packets, so the DF flag is formally obeyed. Whether its intention is bypassed is debatable. One has to recall that the intention of the DF flag has changed; in the era of ubiquitous path MTU discovery the DF flag is almost meaningless (set on virtually all packets). [4]

[12] seems to indicate that the upper level big packet in the receiving OS may correspond to the big packet in the sending OS, but we have doubts about this. The intention (efficiency in the OS) and placement of offloading in the whole process of IP seems to hint that the big packets are just ephemeral entities in the sending or receiving OS with no intended correspondence of these entities between sending and receiving OS. However, de facto, the flow of packets and the use of PSH flags in TCP packets might approximately achieve such a correspondence: One big packet at the sending end turns into a burst of small packets on the network, and may then be likely that those very packets are then reassembled into the same large packet in the receiver.

TSO: TCP segmentation offload

This applies to outbound TCP traffic. Segmentation means the splitting of the large packet at the higher level of the IP stack into MTU-sized packets before they go out on the network. [13]

UFO: UDP fragmentation offload

Apparently it is uncommon for network cards to handle segmentation of outgoing non-TCP traffic. Indeed our network interface is shown as UFO fixed off. [13]

GSO: generic segmentation offload

GSO is apparently a generalisation of TSO to traffic that can be TCP or not. Given that non-TCP traffic can probably not be segmented, GSO should be equivalent to TSO. [13]

LRO: large receive offload

This applies to inbound traffic. Multiple incoming packets are combined into a bigger packet for the higher levels of the IP stack. Oddly, this is off and fixed to be off. [14]

GRO: generic receive offload

It is unclear what this is, but we can turn it on or off. Clearly it applies to inbound traffic. Also, our experiments show that incoming MTU-sized TCP packets are turned into large TCP packets. Since this is the only receive offload setting that is on, this must be the setting responsible. [14]

Should we offload or not?

The objective of offloading is efficiency and throughput. This may or may not happen, depending on the implementation in the networking hardware and on the power of the CPU and operating system. In the Linux community, enthusiasm is limited also because the offloading code is proprietary and cannot be fixed for security problems it may have. [3]

In our case, we have to turn off offloading at least in part. This is to allow the use of the 3.2.60 kernel without stifling traffic from servers to Windows clients. If we want to try a partial switching off, we need to recall the problem. After the router kernel sees the large packet assembled by the receiving network interface, it seems to refuse to pass it for transmission.

Initially, we should turn off TSO and GSO, but expect that that will not fix the problem. Then we should turn of GRO instead and expect that to fix the problem. We have already established that turning off all three does fix the problem.

In the end we should probably go with the open-source argument [3] and turn off all three offloads - TSO, GSO, GRO - on all network interfaces.

How to turn it off

The ethtool utility, in setting features on or off, uses feature names that have no resemblance to the feature names displayed above. In particular [10]:

ethtool -K eth1 gso off
ethtool -K eth1 tso off
ethtool -K eth1 gro off

ethtool -K eth1 lro off
  Cannot change large-receive-offload
ethtool -K eth1 ufo off
  Cannot change udp-fragmentation-offload

Finally, we have the extra complication that on one network interface the router does VLAN tagging [15]. In addition to eth3 itself we have several interfaces eth3.N where N is the VLAN number. If we turn the features off only for eth3, then the eth.N interfaces show the feature as off but requested on. This is sufficient for the routing problem to go away.

The feature manipulation does not persist across a reboot, so the obvious place to make these settings is as a pre-up or post-up command in /etc/network/interfaces for each interface as and when it is brought up.

auto eth3.2
iface eth3.2 inet static
  address 192.168.131.254
  netmask 255.255.254.0
  network 192.168.130.0
  broadcast 192.168.131.255
  post-up ethtool -K eth3 gso off
  post-up ethtool -K eth3 tso off
  post-up ethtool -K eth3 gro off

This changes eth3 itself and not eth3.N, but that is sufficient.

The fix in practice

We returned the router to its faulty state - kernel 3.2.60 and offload on - and then turned off offload features one by one and interface by interface.

Turning off GSO and TSO only seems to have not positive effect. The kernel will still receive large packets and still fails to segment them. It is necessary to turn GRO off.
Turning off GRO only seems to fix the problem. The kernel will receive only MTU-sized IP packets and has no trouble sending them out again.
Turning off offload on all eth3.N is not sufficient, it has to be turned off on eth3 itself.
Turning off offload on eth3 only also effectively turns it off for all eth3.N.

In general, on non-routing Linux systems, we leave TCP offload unchanged. But when there is a big problem like this, we turn it off altogether - all three features on all ethN interfaces.

References

"Maximum transmission unit (MTU)". Wikipedia.
Debian (2014). "DSA-2972-1 linux -- security update". Debian Security Advisories.
"TCP offload engine (TOE)". Wikipedia.
"Path MTU Discovery (PMTUD)". Wikipedia.
Charles M. Kozierok (2005). The TCP/IP guide.
"Maximum segment size (MSS)". Wikipedia.
"Transmission Control Protocol (TCP)". Wikipedia.
"TCP window scale option". Wikipedia.
"IP fragmentation". Wikipedia.
Jeff Morriss (2012). "Re: wireshark sees jumbo TCP packets in linux". Wireshark-users mailing list.
"Capture setup - Offloading". Wireshark wiki.
Dan Siemon (2013). "Queueing in the Linux network stack". http://www.coverfire.com
"Large segment offload (LSO)". Wikipedia.
"Large receive offload (LRO)". Wikipedia.
"Virtual LAN (VLAN)". Wikipedia.