Infiniband Over Ethernet is better than Ethernet says VMware

There is a lot of talk about the value of Infiniband as a storage protocol to replace FibreChannel with several SSD vendors offering Infiniband options. Most likely this is necessary to allow servers to get enough network speed but mostly to reduce latency and CPU consumption. Good Infiniband networks have latency measured in hundreds of nanoseconds and much lower impact on system CPU because Infiniband uses RDMA to transfer data. RDMA ( Remote Direct Memory Access) means that data is transferred from memory location to memory location thus removing the encapsulation overhead of Ethernet and IP (that’s as short as I can make that description).

Infiniband works especially well for server area networks because the scale is relatively small. A few hundred servers is a good size for a Infiniband switch network. In

VMware has demonstrated some testing that proves this point. This RDMA on vSphere: Update and Future Directions shows testing by VMware on the enormous performance benefits for vMotion when using RoCEE – RDMA over Converged Enhanced Ethernet. By using the benefits of RDMA to reduce the protocol encoding overhead the performance can be dramatically improved.

I’ve taken these images from testing report (see previous) performed by VMware (CTO Office) and they show a striking performance improvement. If VMware moves to adopt RoCEE into the core hypervisor then these tests suggests that we could massive performance improvements, especially in CPU consumption, in the data centre without the hassle of using 40Gbe and 10GbE Ethernet.

In my view, the reduction in CPU has serious implications for improving guest mobility in large scale systems that have high GuestOS density.Imagine a server with fifty GuestOS using 80% less CPU during vMotion ? Excited ? You bet.

There is some serious impacts about how I look at Ethernet  networks after reading this document.

Rdma ethernet infiniband 1

Rdma ethernet infiniband 2

Rdma ethernet infiniband 3

Rdma ethernet infiniband 4

Rdma ethernet infiniband 5

Stunning results.

  • http://twitter.com/geoffarnold Geoff Arnold

    I’m curious why you say that this will yield benefits “without the hassle of using 40Gbe and 10GbE Ethernet” when the test hardware was running 40GbE. (Of course a server with a 40G NIC is a rare and expensive beast; most regard 10GbE copper LOM as the price/performance sweet spot right now.)

    • http://etherealmind.com Etherealmind

      The hassle of software switching solutions is a real pain. When using RoCEE, the configuration of storage and data networks is much simplified since each network is separated into the different RQ channel. Much more reliable, faster and easy to operate when compared to Ethernet.

  • Jason Costomiris

    While cool, for sure, in the HPC space, or even general compute pod space as we see apps sprawl across multiple servers (so-called east-west comms), this still seems like it will be a tough sell into a number of financials.  VMware’s blog post that showed the related video from the OpenFabrics con refers to this as “ULL”.  Of course, ULL is what all the cool kids in finance are after.

    Real-time financial data yields a small message size, and the graphs from the slide deck show substantial gaps between bare metal and VMs at smaller message size.  Up through 256 bytes, we’re talking 4us vs 32us. Giving up 28us is a huge deal breaker in that community.

    The later stuff on slide 12 is encouraging, but to approach bare-metal performance, you need to do passthrough, dedicating NICs to specific VMs, and wind up giving up man of the benefits that vSphere brings to the table.  vMotion probably isn’t a huge deal, since these types of networks typically run a/b side live-live feeds with multiple servers.  The question that comes to mind for me is that if I’m already enjoying ULL latencies on cheap 1RU servers using CEE NICs and doing RoCEE, why would I want to increase my failure domain, consolidating several servers into a larger system, while maintaining the same network footprint?

  • http://www.londoncleaner.org/page-after-builders-cleaning.html Cleaner London

    VMware has demonstrated some testing that proves that it is beneficial and its performance is rather improved. I also agree that using RoCEE, the configuration of storage and data networks is much simplified since each network is separated.

    • http://etherealmind.com Etherealmind

      The use of RQ is much more effective way of configuring the network. Using FC for storage and Ethernet for data in dedicated channels is a much easier way of delivering networking.

  • Ricardo Oliveira

    You say: “RDMA ( Remote Direct Memory Access) means that data is transferred from memory location to memory location thus removing the encapsulation overhead of Ethernet and IP (that’s as short as I can make that description).” However, as soon as it is Infiniband over ethernet it will get an Ethernet and IP header. Is it then not the same as, for example, FCoE?

    • http://etherealmind.com Etherealmind

      No, not at all. Because the data stream is handled by the IP stack the data is handled by the a different CPU process this requires many bus transactions.

      RDMA provides a memory-to-memory data transfer that was specifically designed to be as efficient as possible and removes the process load and latency. It’s a completely different process when compared to IP protocol.

  • ldavis02

    Greg,The presentation you site seems to be an exploration of options, and reaches a general conclusion that
    “Some sort of hypervisor-level RDMA would be highly benficial for VMWare”.

    It is not necessarily an indication of support for RoCEE as the direction to go. 
    Also, the latency figures for RoCEE, while impressive, as still much higher than many HPC users would find acceptable.

    A summary of the options:
    Options A-CNot viable in the long run.

    Option D:SoftRoCE / Paravirtual vNIC / 10GbE uplinkTested.  Latency is much lower than conventional Ethernet/TCPIP, but still much higher than native IB.  For ‘general purpose High Performance Computing’, the latencies may be acceptable (or they may not be.)

    Option E: VM-VM SoftRoCE / Paravirtual vNICTested. Latency is somewhat lower than conventional Ethernet/TCPIP, but higher than Option D.

    Option F: Paravirtual RDMA HCA (vRDMA) offered to VM
    Not tested (doesn’t exist yet).  This is considered by the presenter to be the desired option.  Runs on an IB HCA (not on an Ethernet NIC).

    IB as a replacement for Fibre Channel– From an IB vendor’s perspective, this would be useful, as the alternative is to use an IB-to-FC translational gateway.  The gateway approach doesn’t work  well, and volume economies can never be achieved (how many gateways will the average customer buy, compared to the number of HCAs and switches they purchase?)  Using RDMA over IB would mean that there would be less of a need to go ‘off fabric’ for storage. 

    Using IB to supply RDMA instead of CEE would assume that SSDs were used for data storage, and that the primary use would be to support vMotion. It would also assume that volume pricing for HCAs would be competitive with NICs, and that server vendors would be willing to integrate an IB/RDMA chipset on to their system boards (‘LOM’).