In a blog post On optimizing traffic for network virtualization Brad Hedlund, now working for DELL/Force10 (having recently handed in his Cisco fanboy card ), attempts to make the point that virtual overlays and tunnels are not problems and that we should just get used to being shafted by dumb ideas.
Pop over and have a read. Yes, I’ll wait right here. Then pop back and read my response.
Point 1 – Lossless Networking
Brad seems to have, perhaps conveniently, forgotten that a Data Centre network should be lossless. And when using tunnels and overlays, it’s highly unlikely that you will be able to build a lossless network. Traffic inside tunnels is not easily parsed and detected for QoS capabilities. Ethernet allows for only five useful QoS levels. For example, you might put a VXLAN tunnel in ToS of 4 but FCoE at ToS of 5 – this means that VXLAN will have frames discarded, probably regularly. Evan just a few lost transmissions will cause temporary performance drops of around 50%.
Unless you are planning to throttle network performance as the server edge so that blocking cannot occur. Or you have plans to not allow over-subscription. Both of those ideas are ludicrous in a modern network – oversubscription is a way of life ( although it wasn’t in the past for this reason).
The point isn’t that tunnels are inherently bad, it’s that they blind the network to content detection. The network loses visbility of the data. And tunnels can shift traffic flows without integrating with the network to adapt to that change.
Now, vendors might want to sell 2 or 4 times more hardware to solve that problem but customers DO NOT want to operate or power that much hardware. And maybe that’s Brad’s viewpoint, there is a solution. Solution: build lots of isolated networks for each traffic type.
Point 2 – Troubleshooting
Coming back to the loss of visibility. Today, we can detect or match traffic flows in the network layer and redirect them to load balancers, IDS/IPS, or sniffers for trouble shooting. The use of VXLAN tunnels means that a second set of networking tools will be needed. To capture packets in a VXLAN will require a device that become a member of VXLAN group as a VTEP and derive a copy of every packet or, at least, packets that match the criteria.
Most likely, this means a virtual appliance on either the source or destination, and probably operating in the VM Direct Path so as not to damage the VM performance too badly. At this point, there are no software tools that can do this, and it’s not core business for VMware who is busy developing Java and email tools and doesn’t much care about infrastructure right now.
Loss of visibility is a serious concern for troubleshooting complicated problems and there are very few answers to getting network visibility today. And I’m not seeing any in the future.
Point 3 – Indeterminate Performance
My current research on software switching suggests that forwarding performance is quite poor. Current generation Intel servers can forward about 4 gigabits/second of data across its internal bus to the network adapter provided that the CPU is not otherwise used. If the CPU is heavily loaded with other tasks, then forwarding performance might, and will, be seriously impacted.
That lack of guarantee is a serious problem in large networks that cloud providers use.
Therefore, software switching isn’t really working yet. I can accept that it might in the future if CPU and bus performance continues to improve, but that is not guaranteed. Therefore, making a point that software switching could work one way, is conveniently forgetting all the other ways in which it’s a problem.
The Risky Trombone
Lets clear up a misconception, the Traffic Trombone is not a major problem inside a Data Centre where the Ethernet fabric is coherent.
The Traffic Trombone is a problem between data centres. There is no way to build an Ethernet fabric that spans data centres. Yes, that’s a blanket statement. If you can build a fabric then the data centres are not far enough apart to matter. Fifty kilometres are not redundant, that’s just a nominal decrease in risk.
That said, there are data centres where too many trombones will cause problems. You can easily overrun the available bandwidth in a data centre if too many application servers are tromboning. And troubleshooting that condition is a real problem, one that has service impacts and big dollar signs attached to it when the network collapses.
In an era of uptime, risk free computing, this isn’t a good answer. I’m pretty confident that Amazon and Google don’t trombone traffic for this reason. Neither should you. But if you must, go right ahead and take that risk. It’s your network.
The EtherealMind View
At this point in time, I believe the path to networking future isn’t going to be defined by hardware, fabrics, or debates over tunnels or overlays. Network professionals need software tools that give network visibility, reporting, and operational confidence.
I don’t really care about VXLAN, or OpenSwitch, or whatever the latest “blah blah cloud” technology is this week. These are all good enough solutions to certain problems. Frankly, VXLAN looks like DLSW for SNA to me, and I tend to think that we will regret tunnels and overlays just like we did DLSW in 2001 because they create state where it’s not needed and where it’s proven to work badly.
What I want is the vendors to deliver me management tools that provide visibility and operations. And that doesn’t include OpenView, or Tivoli, or BMC Patrol are any of the tools that we have today. They are, all, well proven failures. Give me that discussion in 2012. Don’t patronise me with “buy more stuff”.