A lesson in TCP routing

In my work as Linux and network administrator my boss and I recently had occasion to learn how the routing of TCP traffic over the Internet in 2014 works very differently from what we learnt one or two decades earlier.

TCP/IP stands for Transmission Control Protocol / Internet Protocol and is the standard by which client server communication happens over the Internet. A client - workstation or web browser on a workstation - initiates a connection to a server - an Apache or IIS application on a server computer across the Internet. They exchange streams of bytes, each end acting both as source and destination of such a stream. The source chops up its stream of bytes into packets, each packet finds its way across the Internet to the destination, where the packets are collected, put back in sequence and the stream of bytes extracted.

"The Internet" is an internet, and an internet is a combination of one or more local networks (usually Ethernets). The local networks are interconnected by "routers". The source of traffic sends the traffic to the nearest router, it sends it onward on a different network that leads closer to the destination. After a number of such hops from one local network onto another the packets eventually arrive at the destination.

The problem

The problem isn't really important. But the subtlety of the cause and the magnitude of the effect kicked off an investigation that overturned a few long-held ideas of how packets are routed between networks that we had in our heads.

The ingredients to observe the problem are:

The client sends a small volume of traffic to ask for a web page. The server then responds with a large volume of traffic to deliver the web page to the client. However, the router refuses to forward almost all traffic from the server, alleging that the packets are too big to be sent on via the Ethernet that leads to the client. The server cannot figure out how to react to the error messages from the router and hardly any of its traffic makes it through to the client.

The problem is quite peculiar and not really important here. There are many measures one could take to avoid the problem, each on its own avoids the problem:

Lessons learnt

TCP handshake and flow control

At the start of a TCP connection there is a three-way handshake. The client sends an empty packet with the SYN (synchronise parameters) flag set. The server replies with an empty packet with SYN and ACK (acknowledge receipt) set. The client responds with a packet with ACK set. [5]

What we had not really realised is how much other information the client and server are exchanging about each other in the handshake.

The handshake probably contains information sent from the client to the server that in the case of a Windows client then causes the problem.

Noteworthy is the MSS parameter that is communicated in the initial handshake. This is the "maximum segment size", i.e. the largest packet size the sender is willing to receive. We find this typically set to 1460, which is the sender's local MTU (1500) minus the length of the IP or TCP header (40). [6]

Also noteworthy is the win parameter. This is communicated in all packets throughout the connection and changes during the connection. It tends to start out fairly large (14600 for a Linux sender, 8192 for a Windows sender) but reduces later to something around the 500 mark. The main purpose of this parameter seems to be for the recipient of the bulk data to signal to the sender that transmission should be stopped for a while to allow the receiver to process the data received so far. [7]

Further, there is a wscale parameter, communicated in the handshake only. win characterises how much data the receiver can receive next, and is used to inhibit transmission while the receiver processes. wscale addresses the problem of using the bandwidth efficiently when the connection is long distance and high bandwidth. The sender should then send more traffic before expecting an acknowledgement. To do so sender and receiver have another buffer to keep larger amounts of traffic while it is also on the network.

In our experiment we saw win=14600,wscale=6 from a Linux client and win=8192,wscale=8 from a Windows server. The larger window sizes therefore are [8]

WL = 14600 · 26 = 934400
WW = 8192 · 28 = 2097152

MTU variations and fragmentation

Traditionally, the sender of packets would use its local MTU as the maximum size of the packets it sends. As the packet hops from router to router, it may happen that a router cannot pass it on because the MTU on the next network is too small. What then would normally happen was that the router would split the packet into fragments. Only the first fragment would retain the full IP header. But all fragments would need new header information to identify the packet and each fragment's order in the sequence of fragments. [9]

Fragments would then be reassembled into packets at the destination, not on a router. [9]

It has always been possible for the sender to flag a packet as "DF" or "do not fragment". If a router is asked to forward such a packet, but cannot, it will send a fragmentation-required error message to the source of the packet, telling it what the MTU value is that causes the dilemma. The source can then interpret the message and resend smaller packets. [9]

Path MTU discovery and DF flag

Several changes in typical behaviour have occurred over time:

These changes are at odds with each other. In the absence of fragmentation-required messages, path MTU discovery can be carried out differently with increasing packet sizes and by detecting resulting throughput problems. [4]

tcpdump and offload

In diagnosing our problem our main tool was to record and inspect traffic as it enters and exits the router, using the tcpdump utility. To our surprise, packets received and sent were too large to fit the MTU, typically small integer multiples of 1460 (MTU minus IP header).

tcpdump inspects packets as they are handed from the network hardware to the operating system. It turns out that the large packets seen by the OS kernel are an artefact of "offload". With offload on, the network hardware does not exchange traffic frame by frame with the operating system. Rather, it combines multiple received frames and splits outgoing traffic into multiple sent frames. [10,11,12]

For a while we surmised that the network card was carrying out defragmentation. In this mental image, the sender of the traffic would have sent a packet far exceeding the network MTU, its sending network interface would have split the packet into fragments, and the receiving network interface would have reassembled the fragments into the original packet. But this did not make sense, as all traffic was marked DF - do not fragment. [9]

Using the additional ethtool utility, we found the network interfaces configured thusly:

ethtool -k eth1
  Features for eth1:
  rx-checksumming: on
  tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-unneeded: off [fixed]
        tx-checksum-ip-generic: off [fixed]
        tx-checksum-ipv6: on
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
  scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: off [fixed]
  tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp6-segmentation: on
  udp-fragmentation-offload: off [fixed]
  generic-segmentation-offload: on
  generic-receive-offload: on
  large-receive-offload: off [fixed]
  rx-vlan-offload: on
  tx-vlan-offload: on
  ntuple-filters: off [fixed]
  receive-hashing: on
  highdma: on [fixed]
  rx-vlan-filter: off [fixed]
  vlan-challenged: off [fixed]
  tx-lockless: off [fixed]
  netns-local: off [fixed]
  tx-gso-robust: off [fixed]
  tx-fcoe-segmentation: off [fixed]
  fcoe-mtu: off [fixed]
  tx-nocache-copy: on
  loopback: off [fixed]

It turns out that between the upper layers of the IP stack in the operating system and the network hardware, each large packet of the upper layers correspond to multiple packets (not fragments) at the lower layer. The smaller packets all fit the network MTU, while the larger packet at the higher levels is more efficient to process. [10,11,12]

The offloading may merely shift the work of splitting and reassembling large packages from the operating system to the network card. But this also shifts the work such that tcpdump in one case sees the small MTU-matching packets and in the other it sees the large efficiently processable packets.

Since they are all packets, the fact that they all have the DF frag does formally not matter. The small units are full IP packets and not fragments of packets, so the DF flag is formally obeyed. Whether its intention is bypassed is debatable. One has to recall that the intention of the DF flag has changed; in the era of ubiquitous path MTU discovery the DF flag is almost meaningless (set on virtually all packets). [4]

[12] seems to indicate that the upper level big packet in the receiving OS may correspond to the big packet in the sending OS, but we have doubts about this. The intention (efficiency in the OS) and placement of offloading in the whole process of IP seems to hint that the big packets are just ephemeral entities in the sending or receiving OS with no intended correspondence of these entities between sending and receiving OS. However, de facto, the flow of packets and the use of PSH flags in TCP packets might approximately achieve such a correspondence: One big packet at the sending end turns into a burst of small packets on the network, and may then be likely that those very packets are then reassembled into the same large packet in the receiver.

TSO: TCP segmentation offload

This applies to outbound TCP traffic. Segmentation means the splitting of the large packet at the higher level of the IP stack into MTU-sized packets before they go out on the network. [13]

UFO: UDP fragmentation offload

Apparently it is uncommon for network cards to handle segmentation of outgoing non-TCP traffic. Indeed our network interface is shown as UFO fixed off. [13]

GSO: generic segmentation offload

GSO is apparently a generalisation of TSO to traffic that can be TCP or not. Given that non-TCP traffic can probably not be segmented, GSO should be equivalent to TSO. [13]

LRO: large receive offload

This applies to inbound traffic. Multiple incoming packets are combined into a bigger packet for the higher levels of the IP stack. Oddly, this is off and fixed to be off. [14]

GRO: generic receive offload

It is unclear what this is, but we can turn it on or off. Clearly it applies to inbound traffic. Also, our experiments show that incoming MTU-sized TCP packets are turned into large TCP packets. Since this is the only receive offload setting that is on, this must be the setting responsible. [14]

Should we offload or not?

The objective of offloading is efficiency and throughput. This may or may not happen, depending on the implementation in the networking hardware and on the power of the CPU and operating system. In the Linux community, enthusiasm is limited also because the offloading code is proprietary and cannot be fixed for security problems it may have. [3]

In our case, we have to turn off offloading at least in part. This is to allow the use of the 3.2.60 kernel without stifling traffic from servers to Windows clients. If we want to try a partial switching off, we need to recall the problem. After the router kernel sees the large packet assembled by the receiving network interface, it seems to refuse to pass it for transmission.

Initially, we should turn off TSO and GSO, but expect that that will not fix the problem. Then we should turn of GRO instead and expect that to fix the problem. We have already established that turning off all three does fix the problem.

In the end we should probably go with the open-source argument [3] and turn off all three offloads - TSO, GSO, GRO - on all network interfaces.

How to turn it off

The ethtool utility, in setting features on or off, uses feature names that have no resemblance to the feature names displayed above. In particular [10]:

ethtool -K eth1 gso off
ethtool -K eth1 tso off
ethtool -K eth1 gro off

ethtool -K eth1 lro off
  Cannot change large-receive-offload
ethtool -K eth1 ufo off
  Cannot change udp-fragmentation-offload

Finally, we have the extra complication that on one network interface the router does VLAN tagging [15]. In addition to eth3 itself we have several interfaces eth3.N where N is the VLAN number. If we turn the features off only for eth3, then the eth.N interfaces show the feature as off but requested on. This is sufficient for the routing problem to go away.

The feature manipulation does not persist across a reboot, so the obvious place to make these settings is as a pre-up or post-up command in /etc/network/interfaces for each interface as and when it is brought up.

auto eth3.2
iface eth3.2 inet static
  address 192.168.131.254
  netmask 255.255.254.0
  network 192.168.130.0
  broadcast 192.168.131.255
  post-up ethtool -K eth3 gso off
  post-up ethtool -K eth3 tso off
  post-up ethtool -K eth3 gro off

This changes eth3 itself and not eth3.N, but that is sufficient.

The fix in practice

We returned the router to its faulty state - kernel 3.2.60 and offload on - and then turned off offload features one by one and interface by interface.

In general, on non-routing Linux systems, we leave TCP offload unchanged. But when there is a big problem like this, we turn it off altogether - all three features on all ethN interfaces.

References

  1. "Maximum transmission unit (MTU)". Wikipedia.
  2. Debian (2014). "DSA-2972-1 linux -- security update". Debian Security Advisories.
  3. "TCP offload engine (TOE)". Wikipedia.
  4. "Path MTU Discovery (PMTUD)". Wikipedia.
  5. Charles M. Kozierok (2005). The TCP/IP guide.
  6. "Maximum segment size (MSS)". Wikipedia.
  7. "Transmission Control Protocol (TCP)". Wikipedia.
  8. "TCP window scale option". Wikipedia.
  9. "IP fragmentation". Wikipedia.
  10. Jeff Morriss (2012). "Re: wireshark sees jumbo TCP packets in linux". Wireshark-users mailing list.
  11. "Capture setup - Offloading". Wireshark wiki.
  12. Dan Siemon (2013). "Queueing in the Linux network stack". http://www.coverfire.com
  13. "Large segment offload (LSO)". Wikipedia.
  14. "Large receive offload (LRO)". Wikipedia.
  15. "Virtual LAN (VLAN)". Wikipedia.