Following my article on Loop-Free Alternate Routes, Michael McNamara made a good point about some of the issues detecting failure in a Metro Ethernet network. This seems to be a commonly misunderstood problem.
Network recovery depends on the time it takes to perform three tasks consecutively:
- How long it takes to detect the failure
- How long it takes to determine the next best path
- How long it takes to update the forwarding tables
The latter two tasks depend on the routing protocol and hardware, but the biggest obstacle to network recovery is usually the time taken to actually detect the failure, whether it is due to a link or device fault.
Tuning the routing protocol adjacency timers or employing Bi-directional Forwarding Detection (BFD) may help to detect a failure more quickly, but this still relies on the loss of several consecutive keepalives and during this time the network convergence process is stalled. Also, care needs to be taken when optimising timers, as a momentary connectivity issue should not cause symptoms that are worse than the underlying problem.
As an example of this, a few years back I came across an issue with an SDH circuit that was reportedly bouncing for a minute every few days. The circuit was investigated and probed at length, but no fault could be found. Obtaining copies of the router configs revealed that the carrier delay timer at one end had been reduced to zero, in the mistaken belief that this would improve convergence times. However, every time an event occurred on the SDH fibre ring it briefly dropped the circuit carrier. This triggered a network-wide re-convergence on the LAN that took nearly a minute to complete, even though SDH automatically repaired the ring within a few milliseconds and should have been transparent to the LAN. The carrier delay was simply changed back to default and no further problems were reported!
Let’s just remind ourselves what happens when two routers have a routing adjacency via a switch, as depicted below. Let’s assume there’s a cable break between Router A and Switch B.
Router A detects the failure as soon as its link goes down, so it immediately takes down the routing adjacency with Router C. However, Router C isn’t aware of the failure because its interface stays up, so it relies on the routing protocol to detect the failure and take the adjacency down. This could take tens of seconds for a typical routing protocol without tuning.
However, the next diagram shows the switch removed from the design and the router interfaces connected directly together with a point to point link. In this case, a link or interface failure will immediately cause the failure to be detected by both routers. For this reason it is good practice to design a network with point to point fibre links between network devices.
The problem with Metro Ethernet is that it acts very much like the first example with the switch between the two routers. Even with a point to point Ethernet Private Line (EPL) circuit, which you’d expect to replicate the second example, removing the link from one end probably won’t cause the link at the other end to go down unless the service supports Link Loss Forwarding (tip: always ask your service provider whether they support this feature!).
Therefore, as Michael points out, with Metro Ethernet it’s not possible to rely on layer 1 / 2 fault detection. In this instance BFD or tuned routing protocol timers may be the only option. Object tracking such as Cisco IOS SLA may also be suitable in some scenarios.
There is a nifty little Cisco feature I came across a while back called Link-State Tracking, which broadly replicates the Link Loss Forwarding feature I mentioned earlier. The switch employing the feature is configured with a group of one or more upstream ports and one or more associated downstream ports. If the links on all of the upstream ports fail due to an upstream fault, the switch propagates the link state to all of the associated downstream ports by taking those links down.
Although this feature seems to be targeted for servers to use an alternative NIC if there’s an upstream switch failure, I can think of other uses too. For instance, I had a customer configure it to propagate the failure of a piece of dark fibre to a service provider router interface, meaning they didn’t have to run a routing protocol directly with the service provider. Let me know if you’ve found any other uses for this feature!