Fate Sharing, Failure Domains and Why VTP Is Awesome

A lot of people regard Cisco’s Virtual Trunking Protocol(VTP) as nothing but trouble. Frankly it’s hard to find many people who will implement it on their network and most people have war stories about full site outages caused by VTP and switch installs.

I find this baffling – it’s a great technology that dramatically reduces time, configuration errors, and improves troubleshooting – features that we should all embrace and use wherever we can. In this post, I want to suggest a different design method for effectively using VTP in your network.

Note: I assume that you understand, more or less, what VTP does and how it works.

Fate Sharing

Fate Sharing is an engineering term that describes failure of interconnected systems. That is, where a series of connected elements are dependent on all other elements to operate correctly and that the failure of any single elements causes the entire system to fail. For example, a car has four tyres. The failure of just one tyre (a flat tyre, or a blowout) causes the entire car to “share the fate” of the tyre and enter failure mode.

Failure Domains

Failure Domains are similar to, but different from fate sharing. The domain is a bound system beyond which the failure has no impact to services. Thus when a car tyre fails on the freeway, not only does the car fail but the failure domain may include other vehicles on the highway who might crash into the poorly controlled car with one flat tyre. The highway isn’t going to “Fate Share” the car failure but other cars might. In this example, the failure domain is somewhat dynamic, depending on the state of the car at time of failure (highway or parked in the driveway), so the failure domain can vary. Compared with fate sharing which usully does not.

In networking, the most common failure domain is that encompassed by a spanning tree of switched infrastructure. The failure of the spanning tree protocol causes the switching network to fail, but doesn’t directly affect servers or other systems. A domain is sometimes referred to as a “bounded system” for the technically minded.

VTP

VLAN Trunking Protocol (VTP) is a Layer 2 messaging protocol that manages the addition, deletion, and renaming of VLANs on a network-wide basis. You can create a VLAN on one switch and the protocol will signal to all other switches to also create that VLAN. In most circumstances, this will also result in trunk ports beginning to forward traffic for those VLANS.

This is a great feature. I can configure the VLAN on one switch and it replicates across my entire network. I can delete the VLAN on a switch and it will remove the VLAN across my entire network.

The †conscientious Network Designer will realise that you have not only new features, but you have created a single failure domain. With a single misconfiguration, you can remove a VLAN from every switch. If that VLAN is critical – then the entire network shares the fate of the lost VLAN.

There are several safety mechanisms built in to VTP to help address the challenge of a large failure domain. The first is the use of server/client model to restrict the number of configuration point in the network which acts as a change restriction or configuration control thus tightening effective network control and reducing risk.

The second is the use of a VTP password to provide further control for intentional deployment . That is you must use a password to add a new switch to the failure domain and and that acts as a risk management feature.

Neither of these technologies address the problem of the failure domain. All switches that belong in a single failure domain are susceptible to misconfiguration of the VTP, whether by user or by systemic failure.

The only technique that can reduce the failure domain is by deliberate design thinking. Let’s look at that.

Making VTP Awesome

Single VTP domain = Single Failure Domain

The most common deployment model that people undertake is to make a single VTP domain and add all switches with the same configuration. It’s pretty typical and, in my opinion,
generally a bad thing.

Features

  • easy to configure – same configuration on every switch
  • easy to understand – all VLANs everywhere
  • so obvious you don’t really think about it.

So your failure domain looks something like that, and all elements will share fate if the configuration data is corrupted or misconfigured.

Vtp failure domain 1

The problem, as many of us know, is that a single mistake can remove a VLAN from every switch in the network. Or the classic worst case, is when someone connects a switch that has a VLAN database that has a higher serial number, they can wipe all the VLANs from every switch in the network.

Question

Now, is this problem caused by the protocol ? Or is that a problem caused by the network engineer ?

I would say that the network engineer/designer has built a single Failure Domain where every element in the domain shares the same fate when a failure occurs.

Cracking the Failure Domain

A more careful design might consider the following:

  • VTP is less error prone than for day to day VLAN configuration than compared to manually configuring, say tens or hundreds, of switches.
  • VTP is reliable, except for configuration mistakes or poor engineering control (the configuration revision problem).
  • VTP lowers costs by reducing configuration time.
  • Creating a single failure domain is a still a bad idea.

But we can mitigate the risk of single failure domain by making many smaller ones.

  • We could make several VTP domains for functional areas in the network.
  • A row of racks in the data centre.
  • All switches in a given building or wing have their own VTP domain.
  • All core switches to be in a separate VTP domain.

So now our failure domain is many smaller domains (for VTP anyway) and would look something like this:

Vtp failure domain 2

In the event of a mistake in one of the domains, the impact would be limited to a smaller area of the network.

Better Fate Sharing Outcomes

I would enhance this design a bit further. I probably would not have a VTP domain on the core switches at all. In this way, a configuration error in VTP on the Core Network does not extend Fate Sharing to all other VTP domains. That is, a VLAN misconfiguration on one core switch is not likely to impact other core switches and this cause a failure in the data centre and building wings. But a misconfiguration in the West Wing of your building doesn’t take down your entire site.

The EtherealMind View

Careful designs consider Failure Domains and Fate Sharing. You can use the same concepts for many other network technologies such as spanning tree (PVST) or L3 routing protocols (BGP backbones/OSPF edges). I†believe that planning for failure is about mitigating the impact of the failure in addition to preventing the failure. Considering the consequences of a failure it may be possible to balance benefit and risk and use that fancy technology to produce a tangible business benefit.

And for VTP ? It’s not VTP that is the problem, it’s how you implement it that matters.

VTP doesn’t break networks, engineers do.

About Greg Ferro

Greg Ferro is a Network Engineer/Architect, mostly focussed on Data Centre, Security Infrastructure, and recently Virtualization. He has over 20 years in IT, in wide range of employers working as a freelance consultant including Finance, Service Providers and Online Companies. He is CCIE#6920 and has a few ideas about the world, but not enough to really count.

He is a host on the Packet Pushers Podcast, blogger at EtherealMind.com and on Twitter @etherealmind and Google Plus

  • http://showbrain.blogspot.com Ben Story

    Amen! I have often wondered why VTP is so villainized by the network engineer community. I have had my network setup with and without VTP. Prior to forklifting my old Nortel network in favor of Cisco we had no VTP and most problems that we had with MACs were finding out that trunks at various points didn’t have the right VLANs allowed.

  • http://inetpro.org Brandon Bennett

    This does not addres the fundamental issues with VTP.

    Fact: VTP is the the same on the wire if it’s configured as server or client. This leads many new technicians into a false sense of security.

    Fact: VTP was designed for a time when we had VLANs that were departmental based with the old 80/20 rule. 80 of the traffic stays on the vlan. 20 percent routes. Routing used to be expensive and slow. Modern networks should NOT be designed this way. Although you solve this by putting a handful of switches into a domain a GOOD network architect should see the silly point of running VTP between two or three switches and just manually configure the vlans.

    How often does one change vlan numbers to warrant creating a failure domain of two or three switches instead of just creating a VLAN failure domain of _ONE_ switch. Modern networks shouldn’t require the use of VTP and shrinking it down to one or two switches now make VTP nearly worthless. Not to mention now you are spending more time architecting a VTP failure domains. Waste of time.

    Finally you make the point, “VTP doesnít break net≠works, engin≠eers do.”. You are correct there sir. Most networks are comprised of more than one engineer. A lot of them are lower end engineers who run day to day operations like add or removing switches, but aren’t good enough to “run with the scissors” just yet. Your network is only as good as your weakest engineer. It’s better to not let the engineer a chance to break your network (or parts of your network)

    Do the right thing ‘vtp mode transparent’ or better (when supported) ‘vtp mode off’.

    • Eliot

      If you only have a few switches I can understand not using VTP, however, this article is relating to a network with a lot of switches where separate VTP domains would help. I think you’re making assumptions the article is not so your points are a little off base. A datacenter could have 30 top of rack switches in one row with more than 20+ rows in the entire center. Given this context, I can see where separating VTP domains would help if the engineers are intent on using it.

      • http://etherealmind.com Greg Ferro

        You make a good point. This VTP domain method has more impact when you have a a thousand switches on a university campus. Or in very large data centers. Some people may not be able to see the wood for the trees in smaller sites where the problems are different.

      • http://inetpro.org Brandon Bennett

        Yes, but i require immediate route changes from BGP.

        I don’t require VLANs changes to be propigates right away. I usually over build a DC with more VLANs that I require. The last datacenter i did that in I built the VLANs 5 years ago and maybe a handful of VLANs have been added since then. This can be scripted and controlled by engineers in windows without running VTP with about the same effort and fit better into proper ITIL-like practices IMHO.

        For a campus, any VLAN that spans 20 switches (or more than 4 really) you have failed at your design. There are some exceptions like guest or wifi access, but I actually mitigate these with controller based APs and VRF-lite (or even full MPLS) to provide further isolation.

        Turn off VTP and spend your engineering time perfecting VLAN numbering and STP.

    • http://etherealmind.com Greg Ferro

      My point is that careful and thoughtful design can solve the problems of a system. For example, BGP is a crap protocol and requires extensive design and planning to keep it working properly. VTP is no different and rejecting it’s features out of hand necessarily the best idea.

      I guess I both agree and disagree with your point. But maintain the all protocols can be harnessed to get good outcomes.

      • http://inetpro.org Brandon Bennett

        Yes but BGPs potential issues can be mitigated by policy.

        VTP has no such policy and you are recommending doing smaller VTP domains. So you are increasing the work load, the design time, to “shrink” failure domains as the old means of mitigating the risks of VTP. Turning it off completely mitigates this risks without that much more trouble.

        The way that people justify VTP makes me wonder if people are turn up and down VLANs on a daily or even hourly basis. You make it sound like it’s so hard to properly keep track of VLANs, the devices they are on.

        Spend your time properly designing your network and vlans and VTP is not necessary and you have zero risk.

        Now if there was a way to implement more policy based VLAN propagation protocol I may conceder it, but it still seem pretty silly for something that happens maybe a max of 10 times a year. (probably more like 4 times a year).

  • http://billyc5022.blogspot.com Bill Carter

    I agree! VTP is useful. Engineers are the problem, not the protocol.

  • Merrill Hammond

    One thing that wasn’t mentioned at all is VTPv3. Cisco hasn’t publicized it much, but it’s available on pretty much everything running 12.2.53 or later (If I remember right, 12.2.55 for sure) Version 3 does a much better job of protecting the database from overwrites and versioning issues.

    • http://etherealmind.com Greg Ferro

      Although VTPv3 is better, it’s still vulnerable to misconfiguration. I suggest that design techniques like this help to solve that problem.

  • Will

    I guess I still don’t see the value in breaking up your VTP domains. If you are maintaining separate domains and grouping VLANs together in those domains, what is the point? The “VALUE” that VTP offers is easing administration of creating large amount of VLANs on a large amount of switches. If you follow this approach you are watering down the “VALUE” of VTP.

    VTP Transparent is the only way. As someone stated earlier there are too many situations that can go badly with a switch in the wrong person’s hands.

    • http://etherealmind.com Greg Ferro

      I’ve experienced too many misconfiguration errors when using VTP transparent. If you think that dynamic routing protocols are good, then dynamic VLAN tools are also good. You should consider mitigating the risk note denying the tools.

      • http://inetpro.org Brandon Bennett

        Dyanmic routing protocols and dynamic VLAN _PROVISIONING_ tools are completely different beasts. I don’t get that comparison at all.

        There is no need for convergence, TE, or other automatic behaviors when creating and removing VLANs.

  • Mike B

    @Brandon – Every single response you made contains typographical errors. My point is not personal, rather technical. You expect me to believe that itís ìsaferî to consistently name and number vlans across the enterprise one at a time or via a homebrew script than use the tool that was designed to do it? We have over 400 switches in our enterprise and the vlan information is the same on all . I can however, tell you for a fact what technician wrote the device banners by the typoís, the extra delimiters, the flawed spacing etc.

    Automation is the key consistency.

    You mention ñ ìSo you are increasing the work load, the design time, to ìshrinkî failure domains as the old means of mitigating the risks of VTP. Turning it off completely mitigates this risks without that much more trouble.
    The way that people justify VTP makes me wonder if people are turn up and down VLANs on a daily or even hourly basis. You make it sound like itís so hard to properly keep track of VLANs, the devices they are on. ”

    You think tracking what vlans are on what devices x 400 is less work than using VTP? We ALL know that even the best kept documentation is in various stages of historical archiving.

    Our keys to success-
    1. Slow down
    2. Config review
    3. VTP trans then VTP client as the last step before deploying anything
    4. Appropriate show commands to verify the box is doing what you think its going to.

    Not saying I have never made a config mistake but there are times when you have to pay extra close attention and when connecting gear for the first time is a good place to start.
    This isnít your house and this ainít a 4 port Linksys. Treat it as such and laziness will have serious ramifications.

    Mike
    P.S. Don’t even get me started on using VSS to eliminate STP.

    • http://inetpro.org Brandon Bennett

      “Every single response you made con≠tains typo≠graph≠ical errors.”

      I blame Portal 2. I was up last tonight too late playing it.

      “We have over 400 switches in our enter≠prise and the vlan inform≠a≠tion is the same on all”

      Ouch dude! MST I assume? You can’t run that many STP on most Cisco switches, so VSTP is out of the question.

      “Automation is the key consistency.”
      I am not against automation. Actually just the opposite. I am quite pro automation for the reasons you spelled out above, but I like to control exactly how that automation is done. I can script RANCID to push out VLANs, NTP devices, change hostnames, etc. All are controlled on the devices I want and I know what I am going to get. VTP is more automagical vs automation.

      Because I am horrible at spelling/typing/etc I am a huge fan of automation, just not VTP

  • sh0x

    “Plan≠ning for fail≠ure is about mit≠ig≠at≠ing the impact of the fail≠ure, in addi≠tion to pre≠vent≠ing the fail≠ure.”

    Nicely said, Greg!

  • PG

    A number of new switch models don’t even support VTP. So Cisco have recognised that VTP is a bad thing in some situations. I can see it’s usefulness in a DC or campus with a lot of switches though

  • http://Www.HP.com/go/networking Christopher Young

    Hey Gergen,

    Great post. Personally, I have a huge distrust of VTP based on having a higher rev database wipe out my production environment in a prior life. Automagic scares me, automation is my friend.

    I would rather use tools to do a scripted manual creation of the clans than use VTP or gvrp.

    Any comments for us standards-based protocol guys on the to pros-cons of gvrp vs VTP? I personally see value in the idea, but experience has taught me not to place this in the hands of lesser mortals.

    To quote uncle ben ( spider man, not the rice guy) ” with great power comes great responsibility”
    And frankly, I still see too many networks were the root bridge hasn’t even been defined.

    @mike b – curious to hear your thoughts on why using VSS ( or IRF from HP or virtual chassis from juniper, etc…) to eliminate spanning tree is a bad thing? What is the objection here? without debating the individual merits of the implementations, is there something in particular you don’t like?

  • Marc Abel

    Greg,

    In your first line:

    A lot of people regard Ciscoís Virtual Trunking Protocol(VTP) as noth≠ing but trouble.

    Shouldn’t that be “VLAN Trunking Protocol” and not “Virtual Turnking Protocol”?

    Thank you for all your great content.
    -Marc

  • Bal

    good article. excellent comments too.
    I like VTP, and enablement of VTP pruning. It keeps the vlan configuration very clean and consistent.
    That said, i have been bitten by VTP and would rather live with the extra config work than have that happen to me again. Other than for the smallest (downtime is not much of an issue) networks. I would turn VTP off.

  • Muhammad Akl

    BTW i’m totally agree with Brandon, when it comes to VTP in campus or a place has alot of switches you will notice that manipulating VTP problems will make your life painful. your will waste alot of time solving a simple issue that is in normal situations does not take mintues( e.g corrupted vlan.dat files).

    That’s why using some manual scripts is better than suffering from VTP stupid problems.

  • kAos

    And what is the impact of a mistake?
    Adding/changing/removing a misconfigured (mistyped) VLAN to devices/trunks on a VTP transparent domain: outage to a not-yet existing or already unused services = inconvenience.
    Adding a misconfigured switch to a VTP active domain: outage to a large number of live customers = discussion with your CTO/COO.
    Looks like a no brainer to me.

  • http://www.amateurgeek.co.uk/ Murali Suriar

    The problem is almost always the engineer; protocols only do what we tell them to do.

    However, while I agree with the above point, I disagree with almost everything else. In this case, the fault of the engineer is choosing a topology which requires VTP in the first place.

    Firstly: big layer 2 domains? Really? Are people still designing applications and systems which rely on having layer 2 adjacencies with other nodes? If so, why aren’t the network engineers of the world rising up as one and saying “It’s the year 2011! Why are we still screwing with this stuff?”

    Secondly, assuming you’re in a legacy environment where large layer 2 domains are required: as mentioned by Brandon, VTP is a *provisioning* protocol. Whereas routing protocols need to react to network events (links/devices failing, loss of reachability to remote networks, etc), what failure conditions require a change in the VLAN information distributed to a given node? The only use case I can think of, perhaps, is a quarantine VLAN for endpoints which have failed dot1x posture verification, and I’d argue that such a VLAN should be configured on every switch by default anyway.

    Given that VLAN information only changes when a subnet is added/removed from the network, or when an access port is configured, what does VTP give you? At this point, everyone should be provisioning ports, subnets and devices automatically. If you have a configuration generation/management system already, then surely getting it to add a couple of lines of trunking/VLAN configuration as well can’t be that much more effort?

    I agree that VTP provides benefits for the lazy administrator, however I’d argue that the risks it exposes you to aren’t worth the risk given that the problems it solves can also be solved by configuration management, which you should be doing anyway.

    ‘vtp mode transparent’ for the win.

    • http://www.insearchoftech.com Matthew Norwood

      “Firstly: big layer 2 domains? Really? Are people still design≠ing applic≠a≠tions and sys≠tems which rely on hav≠ing layer 2 adja≠cen≠cies with other nodes? If so, why arenít the net≠work engin≠eers of the world rising up as one and say≠ing ìItís the year 2011! Why are we still screw≠ing with this stuff?î”

      I agree with you! Unfortunately, the trend these days is to have a gigantic layer 2 network once again. Possibly even shared between multiple sites.

    • ijdod

      Unfortunately, the bane of network design in the fact that we can typically make ‘anything’ work on the same equipment. So, after hearing out complaints, management will give in to the demands of redundant systems using inappropriate fail-over models. After all, if we can ‘just’ span a L2 vlan across the country, why would they need to buy that very expensive L3 fail-over license for that hot database they insist on using…

  • Mike B

    @ christopher y. It’s because “most” people implement things because of something they don’t understand. You’d be amazed at the number of “senior” level certified folks that don’t understand how spanning tree works and what its used for. A fair number of folks I see these days sadly are all part of the take it out of the box,add a vlan and call it done.

    For our enterprise, VSS doesn’t add anything other than another point of failure and complexity. We are a government shop responsible for fire and multi jurisdictional 911 dispatch and our Enterprise over has had over 6 9′s of uptime for over 4 out of the 5 last years. The one miss was when a construction crew triggered the FM-200 system and the datacenter went dark.

    Mike

  • Ben C

    While VTP sounds good in theory, a secure network will likely have pruned unused VLANs removed trunks and will have access ports shutdown and in an unused VLAN. Hence, when it comes to provisioning a new VLAN there are other tasks and considerations other than the creation of the VLAN to consider. Whether this whole process is done by a script (ideally) or manually, the benefits of using VTP are significantly diluted when these other tasks have to be considered and implemented.

    Also, in the world of virtualisation how difficult is it to define clear cut VTP boundaries? Thinking especially when it comes to long range VMotion.

    Cheers.

  • Brett Mason

    Typically if an organization’s technical network department is against utilising VTP, I see this as not an issue or problem with VTP as a technology, but rather that the organisation has some other issues or concerns such as, restrictions on admin access to devices, configuration backup, configuration and device monitoring, and LAN design/topology.

    If these concerns can be mitigated then the benefit of VTP far out way the potential risks of using it, IMHO.

  • Richard P

    Being able to blame someone doesn’t help if a service is affected or there’s a major outage. IT managers just won’t accept the risk, especially if the argument for VTP is not that it’s any more powerful or featured, just that it’s ‘easier’.

  • http://libertysys.com.au/ Paul Gear

    Caveats:
    - I work with an all-HP network, so i have no experience with VTP.
    - I am no networking guru, i hold no certifications, i’m just a happy user.

    I am absolutely baffled at the possibility that a VLAN registration protocol could cause outages. I use GVRP extensively in my network and it has never caused any downtime. Rather than having to design my network around GVRP’s limitations, my network design is actually simplified by turning on GVRP, because i don’t have to care about where a VLAN physically exists; i can just put ports in that VLAN (on opposite sides of the campus, if need be), and GVRP will link them all up. (I work in a campus LAN environment where there are no WAN links, and link congestion is basically never an issue – if it was, i would be a lot more careful about where the VLANs live.)

    This discussion makes me wonder whether the problem is the VTP protocol itself. As far as i can tell (from reading doco and looking at packet dumps in wireshark), GVRP has no concept of a database or serial number, so no one switch can affect any other switch’s view of its own VLANs. As long as a switch has ports in a VLAN, it will send out advertisements for that VLAN, and as long as GVRP is enabled on the appropriate ports, it will receive advertisements for that VLAN. Everything is fully distributed, and the disabling of GVRP or an individual VLAN on one switch will have no effect on other switches, unless they were just carrying the VLAN and weren’t actually using it, in which case they don’t care anyway.

    Check out GVRP. At least on ProCurve gear, it is rock-solid and saves a lot of work.

  • Carlos

    Couple of notes I would like to make:
    -Don’t confuse the need of the function with the quality of the protocol.
    If many run from VTP is because the problems it MAY generate are of great magnitude.
    I’d call it a brittle solution.
    -VTP as today is a patch from what used to be at inception, AFAIK.
    Client switches were not supposed to keep state…
    -There might be a better way to do vlan config flooding, may be the OSPF way of aging out LSAs is better than a triggered delete, which seems at the root of most of the problems.

    Engineer problem ?
    Sure. Until we pass the singularity, I expect all problems to be human errors !

  • Pingback: Show 59 – Design Clinic 1 – Is This Virtual Whiteboard On ?

  • Pingback: DTP and VTP: no go. « Reggle

  • Pingback: Basics: Cisco VLAN Trunking Protocol – Transparent discard and passing VTP Packets — My EtherealMind