How TRILL (and SPB) can reduce STP risk and mitigate impact

I take the view that many people don’t appreciate Spanning Tree Protocol for it’s unique ability. It’s certainly a protocol designed in a different times and for different reasons. Today, STP has scalability problems and they are well explained in Ivan Pepelnjak’s Transparent Bridging (aka L2 Switching) Scalability Issues just this week.

There are very few mitigation techniques to solve the BUM problem and some of the current STP optimisations will break, or fail unexpectedly in large East/West network designs. For example, the use of Port Fast1 means that some traffic loops can occur before the BPDU loop is detected and ports are shutdown. Not common but it can happen and sometimes get out of control in very fast networks causing the usual STP meltdown. Thus, we still need to address the limits of spanning tree as a technology

The TRILL Effect

While there are a number of advantages to TRILL, the short term gain is to reduce the impact of the STP domains. If you can agree that a single, very large, STP domain is a problem then you should also agree that several smaller STP domains would be an improvement. Let’s assume that that a typical network would look something like this. We have a core of 6 switches with a sample of six access layer switches. One pair of the core are the root switches and cabling looks approximately like this:

Trill improves stp 1

Lets replace STP in the core with TRILL. Now that we have a loop free core, we can choose to create a full mesh ( or partial mesh according to your needs) thus the cabling between switches can be shown like so:

Trill improves stp 2

Lets add back the Access Layer switches to each of the TRILL core switches, and map the STP domains

Trill improves stp 3

This stylised diagram shows the impact of TRILL to reduce the size of the STP domain. Remember that there is no routing in this discussion, only switching at Layer 2 therefore poor technologies like VMware vMotion and Microsoft’s NLB will still work even if they are not connected to the same STP area.

Of course, there is one weakness here. Consider if someone connects two access switches in two different STP domains. It’s my understanding that TRILL will still handle this loop by interoperating with STP but I need to do some more research here before I could be confident about that.

Trill improves stp 4

Two Layer Model

The three layer switch model is well established using access / distribution / core layers. This was only necessary when silicon was slow and expensive. Today, two layers is more than enough to troubleshoot and maintain so don’t add more complexity, just keep it simple.

The EtherealMind View

It’s difficult to avoid STP meltdowns in certain scenarios but, with careful design and attention to detail on your STP enhancements you can make a very safe L2 networks. But, you can mitigate the impact of this risk by creating smaller STP domains and using a TRILL / SPB does exactly that.

Today, this design isn’t very practical because vendors want to charge a hefty extra premium for TRILL / SPB features. In my view, TRILL / SPB isn’t worth the price that vendors want to charge for almost all networks. Therefore I’d recommend waiting another year or two before committing. In the meantime, you can start discussion and planning around the future of Data Centre or Campus LAN and how you can take advantage of Equal Cost Multipathing in your network core with TRILL / SPB.

Disclosure

I have nothing to disclose in this article. My full disclosure statement is here


  1. Because port fast assumes that there are no BPDUs to be received on the interface it will move to forwarding state immediately. If BPDUs are received, then the port will move to blocking state…… usually, but not always, before a loop has paralysed the network. ↩
  • ftallet

    A port configured for portfast will not necessarily revert to blocking when it receives a BPDU. It will just become a regular STP port, and *might* block based on the information it receives.
    The loop between sites that you have put in your diagram will be identified by ISIS hellos with TRILL. In this respect, TRILL is working as an overlay solution and could experience the temporary loop that you’ve described with portfast – when you add this backdoor connection, there is a loop until TRILL hellos have been exchanged. To work this around, TRILL can sense STP changes and revert to a blocking state while STP is recomputing… not very efficient in term of convergence time and network impact (you can end up blocking for a long time, even if STP is reconverging for something that is not a backdoor connection).
    Both FabricPath and SPB will have the “L2MP core” behave like a bridge running STP. As a result, they’re not affected by the overlay effect I’m describing. It’s STP that will take care of blocking somewhere the path you’ve added. You have practically merged the two STP domains into a single one.
    Regards,
    Francois

  • Will

    Thanks!!!!!!!!!!!!!!!!!  I’ve been waiting for someone to start blogging something on the deployment of TRILL or FP.  

    So is the plan to implement TRILL in phases (what i assume back in the day people implemented STP before it existed back in the 90s)?  
    I’d think this day in age it would be all or nothing.  I see more complexity added above than just a ‘simple’ STP domain.  

    Unless TRILL is some plug and play protocol that we’ll never touch like I assume people were brought to believe STP was back in the 90s…riiiiiight.

  • Alexander Papantonatos

    Smaller STP domains are indeed more manageable and easier to properly configure for stability than large STP domains. But AFAIK most “Ethernet farics” are designed with end-to-end implementations in mind. In this scenario the end systems (i.e. servers) are directly connected to the TRILL core or in a 2-layer architecture the access switches are TRILL switches and are part of the “Ehternet fabric”. In these scenarios there is no STP whatsoever and any loops are handled by TRILL using the underlying routing protocol ISIS, FSPF, etc. That’s a big improovement over STP or even over smaller STP domains.

    The cost of TRILL hardware is indeed high. But you have to take into account that most TRILL like implementations have much more overal capaciy (backplane b/w, port density, port speeds, et.) compared to non-TRILL capable hardware. As an added bonus because most TRILL hardware is based on new designs it has new features like virtualization awareness not available on legacy platforms. In order to better understand the extra cost of TRILL you should do an apples-to-apples comparison by comparing the cost of a TRILL platform with a legacy platform of similar capabilities and characteristics preferably from the same vendor. 

    P.S. Love the blog. Keep up the good work.