◎ Introduction to How Overlay Networking and Tunnel Fabrics Work

In this blog post I’ll  make an attempt to summarise Overlay Networking in a couple of paragraphs to act as reference for upcoming blog posts that discuss the nature of Tunnel Fabrics in Physical Network environments.

This article assumes that you have some exposure to network including topics like VXLAN, Ethernet Fabrics, Leaf/Spine (CLOS) and a bunch of technologies that are still early in market adoption but well accepted as the long-term future of networking. Your mileage may, of course, vary.

Hypervisors and Network Connections

Consider a number of hypervisors connected to the network1 as shown in the diagram here:

Hypervisors Connecting to the Physical Network

Hypervisors Connecting to the Physical Network

In current network best practice,  the physical network sets the trust boundary for packets in the switch at the Top of Rack. Networks do not trust servers as a canonical source because the server is outside the administrative control of the network team.

Technologies such as QoS marking, traffic shaping, MPLS, traffic monitoring, NetFlow/sFlow are commonly deployed to the edge of the network. To some extent, this reduces the complexity of the network by distributing the load and configuration into smaller chunks thus reducing the service impacts of poor code or overload [failure]2.

The virtual switch in the hypervisor today is not a switch or a network device of any sort. It’s more like a software controlled patch panel that connects the Network Adapters in the Virtual Server (vNIC) to the Physical NIC for network connectivity. These types of virtual switches are passive devices at best.

Upgrade the vSwitch to Network Device

Lets change the function of the “virtual switch/patch panel” to look more like a network device. Instead of simply connecting two software end points inside the hypervisor memory, lets take the vSwitch and make it a complete network device with routing, switching, QoS, flow management and has an entire configuration interface.

At this point, a network agent is not very useful. There are only a couple of physical connection point to the network via the physical network adapters so routing or switching has limited value. What we need is something more to make use of this concept. The solution is quite straight forward.

Consider the business impact here. The server is now part of the network. Network teams will have permission to enter the server and take ownership of the network connectivity and gain better end to end control. And Security will support this move to ensure the integrity of the network edge. You can expect the virtualization and server teams to resist this change in practice.

Network Limitations

Sending packets & frames into the physical network for forwarding is current best practice. Anyone who works on a Data Centre network can easily identify the serious problems & limitations of current technology.

Spanning Tree remains a risky technology and subject to unexpected bridge loops that can cause loss of entire Data Centres.

Traffic isolation and multi-tenant security is possible in limited ways. Software virtualisation in core switches can create a few isolated “device instances” but for a network with hundreds of tenants there are no answers. MPLS is expensive to implement in hardware and complex to administrate. MPLS remains a niche technology for certain markets like Service Providers with extensive human infrastructure resources but seems to have little relevance & very low adoption in the data centre.

IP Routing protocols are eventually consistent and indeterminate. They require enormous investments in engineering resources to accurately predict the behaviour of protocols like OSPF & BGP. Control and configuration of these protocols is often limited to physical control of [network paths]3

Most network engineers don’t regard these networking technologies as poor technology or even failures but I certainly do. Network pathing is an unreliable process that has basic assumptions in their designs of eventual consistency, automatic discovery, no end-to-end validation, limited loop prevention and able to calculate just a single best path through a given network. In a data centre network, none of those assumptions are useful or relevant. The data centre is a tightly bound problem space where all conditions are tightly controlled, restrictions are possible with the walls of the physical location. Compare this with the design assumptions of a Wide Area Network where there are few options for control in a highly distributed network.

The Data Centre network is a very different problem space compared to WAN or Campus network by its very nature.

Network Agent with Tunnels is an Overlay

The Network Agent in the hypervisor is now able to act as a full network device but remains connected to a physical network that is change resistant and a single shared failure domain through the use of distributed networking protocols. Therefore we could connect the Network Agents with tunnels using protocols like VXLAN, NVGRE or NVO3.

These LAN Tunnel protocols are specifically designed to have optimal function in a data centre network unlike IPinIP or GRE protocols. For example, VXLAN Headers have enough entropy in the Ethernet header to effectively load balance over a LACP bundle between two switches. Other protocols have their respective features.

The starting vision of the overlay network looks like the following:

Connect vSwitches with Tunnel Protocols

For example, we could emulate a VLAN by creating forwarding traffic through tunnels that are associated with Virtual Machines (VMs) in like this diagram:

VLAN Emulation in an Overlay Network

And the switching network path between two VMs looks something like this:

Switch Path Through Overlay Network

And a routing network path is equally simple since the Network Agent would forward according the tunnel that has the best path to the destination just like any other router. Yes, the Network Agent is performing routing by selecting the tunnel that is the best path to the destination just like a physical router.

Routing Path Through an Overlay Network

Abstraction from the Physical Network Allows Change

The collection of tunnel circuits between the Network Agents are sometimes called a “Tunnel Fabric“. The tunnel protocols may, or may not have an awareness of the physical network depending on the progression of technology. At the time of writing, it’s not clear whether the Tunnel Fabric should be integrated with the physical network devices so that the Tunnel Protocols are aware of the status of the underlay network.

But the most interesting features is that the Overlay Network that has been built is fully abstracted from the physical network. That is, the network agents can modify the configuration of the tunnel without any impact to the physical networks, and without any interaction from the physical network.

The Value of Software

One key factor is that the Network Agent is software that has no dependencies on hardware. In physical network devices, the software in the control & management plane is often limited by the silicon. By limited, I mean that the device operating system is limited to a product deployment of perhaps a million or so devices, software features are determined by silicon in the switch.

Note: This especially applies to hardware switching but some routers are more flexible because their architectures rely on software for many features. For example, a Cisco Nexus 5500 is rigidly limited by its hardware design because the silicon was never designed to support routing, only switching.

By comparison, a Cisco ASR1000 has a completely different silicon architecture that is design to be somewhat flexible. This isn’t a criticism. For these devices to perform packet processing at tens of gigabits per second with multiple packet streams, HQoS etc requires custom silicon.

A Network Agent does not need to handle the volume because the processing is distributed into every server. Instead of one pair of HA devices, we can scale horizontally since each additional server adds more forwarding capacity to the network.

Using software on the x86 platform is significantly more flexible and reliable by comparison. The X86 architecture is well understood, programming languages like C or Java have excellent tool chains, unit testing and large pools of programmer expertise.

Performance Causes Problems

The metaphor I often use is that todays physical network devices are like Formula 1 Racing cars: vehicles that go fast & furious but require an expensive and specialist team of resources to keep them running and recover from the repeated crashes.

Putting networking into a hypervisor on an x86 is equivalent of a family sedan for transport which is cheap, simple, easy to service and available everywhere. With enough family sedans you can a lot more done than a F1 car and an acceptable price. Very,very few people actually need a car that can perform above the speed limit. Lets face it, networking vendors like to sell “F1 performance” with reassuringly expensive pricing but the vast majority of customers only need family sedans.

I say this because many people believe that network devices must be custom silicon hardware. I think that this is no longer true. Intel has demonstrated that current generations of x86 hardware & software are capable of delivering forwarding performance of at least 20 Gbps and next generation will be more than 40 Gbps – effectively line rate performance for a server at load. You do not need a hardware network device for everything (only some things).

Supporting Safe and Rapid Change

The final point of software based network devices is the speed of configuration change. The use of standard & common software on standard & common hardware creates opportunity for rapid development of new features. Provided that each network agent runs as a standalone element its possible to rapidly change the software. Today, each network device is part of single coherent system by virtue of shared routing and switching protocols.

Autonomous protocols also create shared failure domains with single cause effects.

Consider that OSPF routing protocol is a single failure domain since every IP router. An OSPF failure can (and does) cause a system wide outage.

When Network Agents are combined with Controller Based Networking, then we have a significant change in underlying nature of networks and the use of Controllers would appear to be the key to the success of overlay networking. But perhaps more on this in later posts.


  1. Doesn’t matter which hypervisors, VMware, KVM, Xen, …… whatever.  ↩
  2. Some examples of overload failures of switches occur when the TCAM or BCAM runs out of space, or CPU/Memory is exhausted or the internal bus cannot handle traffic patterns.  ↩
  3. Mandating that devices are connected according to strictly defined plan is a workaround not a solution. Deviation from plan is likely to result in network failure or sub-optimal outcomes.  ↩

Other Posts in A Series On The Same Topic

  1. Blessay: Overlay Networking, BFD And Integration with Physical Network (25th April 2014)
  2. ◎ Blessay: Overlay Networking Simplicity is Abstraction, Coupling and Integration (10th December 2013)
  3. Integrating Overlay Networking and the Physical Network (21st June 2013)
  4. ◎ Introduction to How Overlay Networking and Tunnel Fabrics Work (10th June 2013)
  5. ◎ Overlay Networking is More and Better while Ditching the Toxic Sludge. (7th June 2013)
  • http://www.cplane.net/ Harry Quackenboss


    Nicely done.


  • Kenn

    Great post, but what does this mean: “Consider that OSPF routing protocol is a single failure domain since every IP router.”?

    • http://etherealmind.com Greg Ferro

      What it is says. One mistake configuring OSPF and you can take down an entire data centre. Feature or bug ?

  • http://blog.packetqueue.net Teren (@SomeClown)

    Greg–I like where you’re going with this, but have a couple of questions:

    (1) Won’t you still have a n(n – 1) / 2 problem for the tunnels? And if so, how do you deal with this complexity?
    (2) Won’t the tunnel overlay mechanism (controller, whatever) be a single failure domain as well? At least assuming that you limit scale problems by not keeping full mesh (point #1) tunnels and allowing dynamic state changes.
    (3) How much traffic in this case is still in an isolated failure domain since, ultimately, all useful datacenter traffic will eventually leave the datacenter. In other words, you still rely on all of the old-school problems you describe above to get traffic in and out to end users. The datacenter may not fail now, but you’ve not solved any of the reliability issues “per-say” from the customer-facing point of view… at least if, in your example above, OSPF craps the bed.

    I look forward to your comments and additional pieces in this series. Great writing as always, and thanks for the thoughts!


    • http://etherealmind.com Greg Ferro

      (1) Won’t you still have a n(n – 1) / 2 problem for the tunnels? And if so, how do you deal with this complexity?

      First. The distributed edge doesn’t need to hold state for every end point like a broadcast system such as Ethernet. Plus memory is cheap in an Intel server.

      Second: Complexity is handled by the controller and does it well. Today’s network complexity is foten caused by self configuring, self discovering protocols that have zero awareness of the end point. Finally, the problem is bound by the walls of the data centre (today). The problem space is entirely different for other networking problems.

      (2) Won’t the tunnel overlay mechanism (controller, whatever) be a single failure domain as well? At least assuming that you limit scale problems by not keeping full mesh (point #1) tunnels and allowing dynamic state changes.

      Not really. There are active/standby or active/active controllers solutions available. VMware Nicira scales to 5 in a cluster. The Overlay Controller doesn’t have to update a lot of information like a physical switch because it knows all of the endpoints in it’s control. It’s up to you to select a design method that controls for such variable, similar to what you do today for the Underlay Network.

      (3) How much traffic in this case is still in an isolated failure domain since, ultimately, all useful datacenter traffic will eventually leave the datacenter.

      None. I don’t really understand your question, because the failure domain is quite limited. For the failure of a server agent impacts only one server. A controller failure can be handled with normal methods of A/S or A/A etc. If the physical network fails then the existing protocols like STP/OSPF/TRILL will re-converge eventually but that is unlikely impact the tunnel state.

      Hope that helps.

      • http://blog.packetqueue.net Teren (@SomeClown)

        Thanks for the replies, Greg!

        On point #3 above, I was thinking more that the datacenter serves content to someone outside of the datacenter, ultimately. So we have all of this redundancy (in a good way) going on in the data center, but ultimately you still end up “out there” in the wild-west that is the Internet at large, and to get from the datacenter to there you still traverse single-point failure domains (OSPF, BGP) whatever that are built in the old way.

        I guess my point is that for all the fail-over in the datacenter, if the IP link upstream craps the bed, I still can’t post that funny cat picture to the server for my mom to see.

        Now, the caveat here I should point out is that I work in the enterprise–specifically a manufacturing environment with campuses all over world, but no true “datacenters” like Amazon, Facebook, Google, whomever. My problems are definitely different. So I study a lot of this by stalking smart folks out and about. :)


        • http://etherealmind.com Greg Ferro

          There are answers for this. For example, Nuage Networks and Contrail are using MPLSoGRE to traverse or integrate to the existing MPLS WAN network. Or you can simply bridge out from the overlay to a physical edge.

          In the Enterprise, Overlays will be a driver for orchestration & automation so as to avoid the change risk of STP, OSPF & BGP. Although MLAG manages this risk it doesn’t remove it.

          Soon after the security people will start to understand the level of separation that can be achieved and new security architectures will change how Data Centres are consumed.