Howard Marks from Deep Storage and long-term curmudgeon sent Ethan & I the following email:
As I continue to tilt at the VMware windmill I’m facing fanbois telling me that all you have to do is plug the EVO:RAIL in and turn it on. This of course leaves out the fact that the little sucker still needs to be connected to the network.
VSAN also uses L2 multicast to communicate between nodes. Back in the dark ages when I was a network guy multicast was an iffy thing.
VMware VSAN uses L2 Multicast to provide synchronisation and signalling between each instance. In specific terms this is a clever use of Ethernet Multicast to reduce network traffic. In reality, it will commonly cause problems.
L2 Multicast Is Simple To Configure
For most network engineers, L2 Multicast is simple enough. Enable IGMP Snooping and walk away. Ethan replied with the following:
More capable switches support IGMP snooping, meaning that the switch will observe IGMP messages and only forward multicast traffic for a particular group over ports where IGMP messages were seen from a host requesting to participate in that group. It is not necessary to configure IGMP snooping, but it does insure that ports that don’t need to see the multicast traffic don’t see it. In my experience, some switches have IGMP snooping enabled by default. Some do not. Unless multicast traffic streams take up significant bandwidth, it’s not a big concern.
Ethan is, of course perfectly correct but my experience of L2 Multicast is far more painful and brutal.
Packet Replication in Switches
Any type of Multicast (L2 or L3) requires that the internal silicon of the switch must duplicates packets. Importantly, the switch must be able to perform this function at very low latencies at high volumes in real time. I’m talking about nanoseconds intervals and tight clocking on the internal architecture of the switch hardware.
The following diagram shows how the four key steps inside the switch itself.
Why Multicast Is Unpredictably Untrustworthy
In cheap switches, the packet duplication function could be performed in software on the management CPU and will result is high latency and packet loss. In some Ethernet switches, the packet duplication function might have a limited capacity of tens or hundreds of megabits per second.
The best switches are those that implement the packet duplication in the crossbar switching silicon and can perform line speed duplication and wire speed switch of Multicast. There are very few switches that are able to do this at all and they all cost very serious money.
My general assumption is that any form of Multicast is always unreliable. Vendors implementations are universally poor quality and testing is almost non-existent since very few customers actually use Multicast in any form. The ONLY way to be sure is to conduct your own testing (which costs about ten to twenty time what a switch actually costs).
The Internet Group Multicast Protocol (IGMP) also has limitations on its performance but for a VSAN setup you are unlikely to have problems.
#Can I See/Troubleshoot/Detect The Problem
You could try capturing packets on all of the ports and detecting which ones are being lost under what conditions. You will need network taps, concurrent packet captures on multiple ports and some method of comparing capture files (Don’t call me though, my life is too short for that short of pointless behaviour). Switches simply do not have features that will show packet loss in the crossbar / switching fabric.
Don’t Forget Your NIC Driver
I often find that NIC Drivers implement multicast badly. I’m sure that VMware think they have tested ESX drivers but I would still be dubious. Without “user feedback” (aka victims who found the bugs) it unlikely that test coverage of multicast is comprehensive. Again, finding vendor bugs is not life enhancing and I’m still waiting for a discount when I find them. After all, they saved money by not testing properly.
#So Your VSAN is Having Problems
Here are some tips that the network might be cause of your VMware VSAN pain:
- switch more than 5 years are almost certain to be a problem.
cheap or low cost switches could use CPU instead of ASICs for IGMP and packet duplication. These will drop packets at moderate transfer rates and could cause VSAN corruption.
purchase a really expensive switch from a well known vendor who will promise that their L2 Multicast is “real good”. Expect to spend many weeks logging bug reports because developer testing coverage is low or non-existent on features that customer do not use. You can be sure that bugs are delivered directly from the developer to you.
blame VMware for making a bad design decision to use Ethernet as a communication protocol. You think they would have learned from the MAC-in-MAC foolishness of vCloud 1.5 ….. but apparently not. Make bug report that Ethernet Multicast is highly unreliable.
A modern switch is design to route and switch IP Packets at high speeds. It makes much more sense to use a full mesh of IP sessions that are tolerant to loss and jitter. Importantly, this operational mode is widely tested and validated by the vendor and in customer use.
VMware VSAN should use reliable and simple networking design instead of choosing too-clever solutions that are unreliable. You would think VMware would learn from previous failures ….. but I was hoping for too much.
Related: My Book — White Box Networking in 2014
I have been writing book that discuses the internal architecture of Ethernet switches. I’m still finishing the book but you can buy a DRM-Free copy today and receive regular updates as I publish them. Click on the widget:
Here are my final words to Howard on the challenge that Storage faces on Ethernet networks:
An Ethernet and IP network always assumes that packet loss is normal. Whether oversubscription, contention, output buffers, reconvergence – unless carefully and explicitly designed from the hardware upwards, loss will happen. For example, Cisco FCoE works when every device is a director-class hardware product configured to specified guidelines, with storage traffic strictly controlled and managed like a FC network.
So no, VSAN will never be reliable while storage based no SCSI doesn’t understand the lossy propagation. What you need is a filesystem based on object storage that assumes variable loss, latency and outages in normal operations.
FibreChannel networking did a great but costly/expensive job of hiding this from storage professionals. Time to wake up and face the reality.