In discussions with a stealthy networking startup today, we were discussing how their overlay network technology for an SDN-enabled WAN was able to to detect network blackouts and brownouts in the physical network. Their answer was to run Bi-directional Forwarding Detection (BFD) in the overlay tunnels. At each end of the overlay, the BFD agent (in both hardware and software) monitors the state of the underlying network through hello messages and it able to determine not only up/down status but also latency, jitter and loss by monitoring timestamps and counters.
You might want to reference Introduction to How Overlay Networking and Tunnel Fabrics Work and Overlay Networking is More and Better while Ditching the Toxic Sludge and even So You Dont Like Overlay Networking before you dive in here. These are part of a six post series on Overlay Networking1
There are two basal strategies to handle overlay/underlay service detection. Out of band or in band signalling.
Out of band Signalling
implies that the external networks is somehow aware of the tunnel state and is able to support some sort of guarantee about the overlay itself. A network controller that owns every part of the physical networks will have the ability to program the end-to-end path. Of course, the controller will need to tightly manage all elements of the end-to-end and have complex interaction to maintain the controller integrity. Alternately, the controller would reduce its complexity by pushing the device configuration into the network device (this seems to be the purpose of the promise theory as promoted by Cisco and purpose of OpFlex).
Secondly, networks device must be able to monitor each flow and detect a performance issue and signal the controller of
The negative impact of the promise theory approach is that the device is more complex to manage itself. Managing a large volume of flows will require custom silicon and extensive new software exchange to signal the out-of-band stats back to the controller.
This would seem more complex, less reliable and potentially more expensive overall. The proof is yet to be seen. We have been using Out of Band signalling in networking through QoS tagging for a long time but with mixed results.
In Band Signalling
This method uses the tunnel endpoints to send messages through the tunnel and look for service information. Sending a regular hello and detecting loss would signal a blackout. Put a timestamp onto the hello and match that on receipt and it’s possible for the endpoint to measure jitter and delay.
For most networks, this is the practical case since the an external network is typically uncontrollable.
When I say “most” I’m pointing to the WAN where the nature of the underlying physical nature is completely unpredictable and unreliable. The use of MPLS to provision oversubscribed circuits on shared services means that most provider guarantee are observed for some portion of the time, say 95%, 97% or to a maximum of 99%. There are very few times when paying for 1:1 bandwidth allocation is possible or practical.
The use of BFD is particularly exciting approach since it s already exists in many network processors and Ethernet PHY chips and could be readily implemented in software. I understand that code is already widely available. Avoiding new protocols will reduce customer resistance.
The Value Of Detection
On the assumption that detection shows that a specific tunnel is failing, what are can be done to rectify the problem ?
In a physical network today, you can only mark the frame/packets using Ethernet COS or IP DSCP and hope that every device in the network will honour the implicit contract to prioritise according to some sort of rule base. If you are lucky, you will be able to configure every device and if you are very lucky each hardware interface will respond the QoS configuration the way that you expect.
In a small physical network of 50 to 100 devices this is possible if somewhat impractical due to the hardware variation. QOS handling is dependent on the Ethernet PHY cha. But if you add 500 or 1000 virtual devices then this be
Alternately, you can mark packet/frames with MPLS tags and use the “circuit emulation” capability of MPLS tunnels segregate different traffic flows into MPLS paths that have different physical properties.
The answer is to divert flows into another overlay tunnel. This works for both data centres and WANs where are, or should be, many paths between two end points. A WAN design today does not use both WAN circuits because existing routing protocols can only select one best path from the possible paths. A network controller and flow based network devices can choose paths for flows at a granular level.
The level of granularity will be determined by the hardware or software in the edge device (whether virtual or physical )
Overlay to Underlay Visibility
How does the overlay know what paths the underlay took ? One possibility is to listen to the routing protocol in the underlay and get the OSPF state database or the BGP announcements and calculate the network layout. Then map the overlay tunnels by matching the IP address of the end points against the routing table data.
I’ve been waiting to hear how new companies would deliver the SDN WAN . Both in-band and out-of-band signalling strategies can, of course, be made to work but the whether one is better than the other is not known. In the following diagram, I shows that a network flow moves through many hops but the end-to-end connection state that has the feedback loop between endpoints. Feedback loops are critical to the design of all complex systems and the only feedback loop in network flows is performed at TCP layer.
The addition of SDN controllers creates a more comprehensive set of feedback capabilities, far more than has ever been possible because of the distributed nature of networking. The use of BFD will provide an in band feedback loop:
The use of in-band control methods are is how I understand the value of the Cisco ACI strategy. Each device in the network has some functions that is able to monitor the flow performance and report status back to the controller. This function is delivered by custom silicon inserted into the forwarding path in each and every network device. This creates a significant dependency on hardware and software in each and every network device and seems to contradict the fundamental design of networking (complex at the edge, simple at the core).
The EtherealMind View
Here are some thoughts about the technology and approaches to the problem:
- The IP protocol is designed to complex at the edge and simple in the middle. Adding “intelligence in the network” provides, at best, a short-term gain and has alway proved to be wasted over the last 20 years. The best case study here is the Internet, dumb at the core and smart at the edge.
- The TCP protocol uses in-band signalling to notify congestion and loss. The TCP algorithm is the heart of the Internet scalability (perhaps making BGP the brain) and SDN doesn’t change this, nor can it. It is my understanding that in-band signalling scales better than out-of-band at a fundamental level.
- Simpler network devices will lower cost and complexity. I can see that “fat and feature rich” network devices (promoted by many network vendors) will be a good fit for many customers but I remain .
- In the event of path loss or degradation, there is a clear line of action for fault resolution from the controller. I prefer this directness or imperative style of operation.
I’m genuinely excited to hear about the SDN WAN over the next 3 months.
As I said earlier, this post in one in a series of posts on Overlay / Underlay networking. Click [here to read the others]2and feel free to engage in the comments.
Other Posts in This Series
- Blessay: Overlay Networking, BFD And Integration with Physical Network (25th April 2014)
- ◎ Blessay: Overlay Networking Simplicity is Abstraction, Coupling and Integration (10th December 2013)
- Integrating Overlay Networking and the Physical Network (21st June 2013)
- ◎ Introduction to How Overlay Networking and Tunnel Fabrics Work (10th June 2013)
- ◎ Overlay Networking is More and Better while Ditching the Toxic Sludge. (7th June 2013)