Response:Revenge of the TOE and TCP Offload problems

Very interesting story from the front lines here where a lot of effort finally discovered the TOE was causing a major problem:

This guy probably spent hundreds of hours testing and researching this problem. He finally admitted to a rather drastic solution, removing the TOE chip from the NICs in multiple servers. From my own research, the firmware on the card can be “problematic” and when the kernel driver is enabled (in Linux or VMware), odd behavior can sometimes be observed, including dropped packets, resets or suboptimal performance. But there’s lots of controversy surrounding this issue.

via Revenge of the TOE - Packet Pushers.

I’m hearing a lot of reports of the problems with TOE drivers and hardware. A recent podcast with Jim Gettys about Bufferbloat also was a problem:

NIC Offload engines generate bursts of line rate packet streams at multi-gigabit rates. These features are now “on” by default even in cheap consumer hardware including home routers, and certainly in data centers. Whether this is advisable (it is not…) is orthogonal to the reality of deployed hardware and current device drivers and default settings.

 The Internet is Broken, and How to Fix It

I’m beginning to think that TOE might be something to avoid. It’s also worth noting the latest generation Intel processors with DPDK make TOE unnecessary. And CNAs for FibreChannel.

The times are changing.

  • Ryan Malayter

    The best argument against TOE I’ve heard came from the Linux kernel mailing list. It went something like this:

    “Why would you take the world’s most hardened, reliable, and interoprable TCP/IP stack and replace it with a dumb closed-source version baked into silicon that can’t be easily changed, even if it has security vulnerabilities?”
    In our shop we disable all hardware TCP acceleration features via Windows group policy or Linux deployment scripts. VMware used o default it to off at the hypervisor layer, not sure about 5.1.
    The CPU overhead of TCP/IP is basically zero on modern hardware, and TOE is just a premature optimization with a lot of buggy implementations.

    • Will

      Also it makes your wireshark capture look ‘ugly’ with all the checksum errors.

      • Bill Karn

        Umm, that’s the least of your Wireshark problems when these features are enabled. When the packet lengths are over 1500 bytes (or 9k for jumbo frames) that should be a dead giveaway that you are not capturing the packets that are actually hitting the wire.

  • Michael Gonnason

    It reduces heat output of the CPU. The NIC is much more efficient at TCP than the host CPU. Most NICs have firmware that can be upgraded for bug fixes and patches,

    • Ryan Malayter

      TOE almost nothing to reduce heat, since modern CPUs spend <<5% on TCP/IP-specific functions with the overwhelming majority of server workloads. You could argue that a CPU-based firewall, router, or IDS might need the extra help of a TOE with all the traffic, but in reality you need to do lots of intensive inspection of the TCP/IP protocol data on such devices, which has to be done in software anyway. So TOE is of little use.

      Firmware updates are operationally painful, and usually require extended downtime (often with a physical server touch). But firmware updates do little to address vulnerabilities, mis-features or bugs baked into the silicon.The drivers and firmware supplied by manufacturers have sucked uniformly since the introduction of TOEs, and have caused nothing but problems. Even on Intel NICs.

      • Michael Gonnason

        Hm I posted a link to a study done, but I think it got eaten…

        Network traffic is actually rather spendy CPU wise.

        General rule of thumb is 1 bit requires 1Hz to process, So 20Gb of throughput requires approx. 20Ghz of processing power, or 8 2.5Ghz cores.