A lot of people regard Cisco’s Virtual Trunking Protocol(VTP) as nothing but trouble. Frankly it’s hard to find many people who will implement it on their network and most people have war stories about full site outages caused by VTP and switch installs.
I find this baffling – it’s a great technology that dramatically reduces time, configuration errors, and improves troubleshooting – features that we should all embrace and use wherever we can. In this post, I want to suggest a different design method for effectively using VTP in your network.
Note: I assume that you understand, more or less, what VTP does and how it works.
Fate Sharing is an engineering term that describes failure of interconnected systems. That is, where a series of connected elements are dependent on all other elements to operate correctly and that the failure of any single elements causes the entire system to fail. For example, a car has four tyres. The failure of just one tyre (a flat tyre, or a blowout) causes the entire car to “share the fate” of the tyre and enter failure mode.
Failure Domains are similar to, but different from fate sharing. The domain is a bound system beyond which the failure has no impact to services. Thus when a car tyre fails on the freeway, not only does the car fail but the failure domain may include other vehicles on the highway who might crash into the poorly controlled car with one flat tyre. The highway isn’t going to “Fate Share” the car failure but other cars might. In this example, the failure domain is somewhat dynamic, depending on the state of the car at time of failure (highway or parked in the driveway), so the failure domain can vary. Compared with fate sharing which usully does not.
In networking, the most common failure domain is that encompassed by a spanning tree of switched infrastructure. The failure of the spanning tree protocol causes the switching network to fail, but doesn’t directly affect servers or other systems. A domain is sometimes referred to as a “bounded system” for the technically minded.
VLAN Trunking Protocol (VTP) is a Layer 2 messaging protocol that manages the addition, deletion, and renaming of VLANs on a network-wide basis. You can create a VLAN on one switch and the protocol will signal to all other switches to also create that VLAN. In most circumstances, this will also result in trunk ports beginning to forward traffic for those VLANS.
This is a great feature. I can configure the VLAN on one switch and it replicates across my entire network. I can delete the VLAN on a switch and it will remove the VLAN across my entire network.
The †conscientious Network Designer will realise that you have not only new features, but you have created a single failure domain. With a single misconfiguration, you can remove a VLAN from every switch. If that VLAN is critical – then the entire network shares the fate of the lost VLAN.
There are several safety mechanisms built in to VTP to help address the challenge of a large failure domain. The first is the use of server/client model to restrict the number of configuration point in the network which acts as a change restriction or configuration control thus tightening effective network control and reducing risk.
The second is the use of a VTP password to provide further control for intentional deployment . That is you must use a password to add a new switch to the failure domain and and that acts as a risk management feature.
Neither of these technologies address the problem of the failure domain. All switches that belong in a single failure domain are susceptible to misconfiguration of the VTP, whether by user or by systemic failure.
The only technique that can reduce the failure domain is by deliberate design thinking. Let’s look at that.
Making VTP Awesome
Single VTP domain = Single Failure Domain
The most common deployment model that people undertake is to make a single VTP domain and add all switches with the same configuration. It’s pretty typical and, in my opinion,
generally a bad thing.
- easy to configure – same configuration on every switch
- easy to understand – all VLANs everywhere
- so obvious you don’t really think about it.
So your failure domain looks something like that, and all elements will share fate if the configuration data is corrupted or misconfigured.
The problem, as many of us know, is that a single mistake can remove a VLAN from every switch in the network. Or the classic worst case, is when someone connects a switch that has a VLAN database that has a higher serial number, they can wipe all the VLANs from every switch in the network.
Now, is this problem caused by the protocol ? Or is that a problem caused by the network engineer ?
I would say that the network engineer/designer has built a single Failure Domain where every element in the domain shares the same fate when a failure occurs.
Cracking the Failure Domain
A more careful design might consider the following:
- VTP is less error prone than for day to day VLAN configuration than compared to manually configuring, say tens or hundreds, of switches.
- VTP is reliable, except for configuration mistakes or poor engineering control (the configuration revision problem).
- VTP lowers costs by reducing configuration time.
- Creating a single failure domain is a still a bad idea.
But we can mitigate the risk of single failure domain by making many smaller ones.
- We could make several VTP domains for functional areas in the network.
- A row of racks in the data centre.
- All switches in a given building or wing have their own VTP domain.
- All core switches to be in a separate VTP domain.
So now our failure domain is many smaller domains (for VTP anyway) and would look something like this:
In the event of a mistake in one of the domains, the impact would be limited to a smaller area of the network.
Better Fate Sharing Outcomes
I would enhance this design a bit further. I probably would not have a VTP domain on the core switches at all. In this way, a configuration error in VTP on the Core Network does not extend Fate Sharing to all other VTP domains. That is, a VLAN misconfiguration on one core switch is not likely to impact other core switches and this cause a failure in the data centre and building wings. But a misconfiguration in the West Wing of your building doesn’t take down your entire site.
The EtherealMind View
Careful designs consider Failure Domains and Fate Sharing. You can use the same concepts for many other network technologies such as spanning tree (PVST) or L3 routing protocols (BGP backbones/OSPF edges). I†believe that planning for failure is about mitigating the impact of the failure in addition to preventing the failure. Considering the consequences of a failure it may be possible to balance benefit and risk and use that fancy technology to produce a tangible business benefit.
And for VTP ? It’s not VTP that is the problem, it’s how you implement it that matters.
VTP doesn’t break networks, engineers do.