TRILL: Its a Deja Vu All Over Again

If youíre old enough to remember what ZX-81 was all about, youíll probably experience a weird sense of dÈj‡-vu when being exposed to the beauties of TRILL. For those of you that have never been exposed to brouters, hereís a short summary:

In the early 1990ís we started building large WAN networks, first with host-to-host links, then with WAN bridges and finally with routers. Not surprisingly, networks built with WAN bridges were experiencing catastrophic failures (extending a single broadcasting domain over slower-speed links is never a good idea). Unfortunately, some networking engineers love to fail multiple times, so theyíve reinvented WAN bridging again and again (if youíre interested in VPLS woes, read the VPLS article I wrote for SearchTelecom).

Learning from the failures of WAN bridging in early 1990s, network designers turned to routers. However, some companies trying to enter the game without the prerequisite engineering prowess tried to cut corners by introducing Layer-2 Protocol-Independent Routers, which looked and acted very similarly to what TRILL is trying to introduce: theyíve used SPF algorithms to compute the shortest path to individual MAC addresses and used all paths in the network (not just the spanning tree) to forward the traffic. Alas, a bridge remains a bridge even when you call it a brouter or a switch and Iíve seen several spectacular meltdowns of brouter-based networks.

The idea of SPF-based bridging got such a bad name that nobody even tried to resurface it for over 15 years, but with the fading memories and (supposedly) completely different landscape, the same technology has made a Phoenix-like reappearance. Its designers added interesting bells and whistles (support for VLANs and hierarchical bridging structure similar to 802.1ah), but itís the same story: a bridge remains a bridge.

The proponents of TRILL are positioning it within the Data Center, and itís probably a valuable addition to the Data Center designer toolbox, but Iím positive once TRILL gets standardized and implemented, some vendors will go out and sell it as a plug-and-play low-cost replacement for routers … and generate a few more spectacular failures.

The sad part of the whole saga is that we had the technologies that could solve the fundamental Data Center issue that requires large-scale bridging (live migration of virtual machines between physical servers) for almost 15 years: Cisco IOS supported Local Area Mobility since (at least) IOS release 11.0 and weíve implemented a LAN with hundreds of hosts using Local Area Mobility in mid 1990ís. Properly designed proven technologies combined with a few boring solutions introduced in the recent years (for example, inter-chassis link bonding available in Ciscoís Virtual Switching System) could solve most of the Data Center virtualization problems, but of course itís more interesting to develop yet another complex technology.

Maybe we should finally grow up and stop playing MacGyver trying to save the world with Rube Goldberg-like contraptions. Maybe we should admit every once in a while that we canít work around every stupidity thrown at us, impose some structure and sound engineering practices in our networks, and tell the host/OS/application vendors how networking is done properly. Until such time, people will gladly tell us that networking is not even close to any science


  • http://asnumber.net/ Matthew Walster

    I’m getting the feeling that your fear of switching becoming increasingly prevalent again is leading down a strange path: Servers having a [ IPv4/32 | IPv6/128 ], and something like IS-IS/TRILL with unnumbered (or RFC1918/ZeroConf/v6LinkLocal) interfaces on the link layer. Then, your access switch would actually be an access-router.

    IMO, it’s quite unworkable in real life, but many people get tempted by such a situation.

    • http://blog.ioshints.info Ivan Pepelnjak

      I have no fear of switching (trust me . It has its merits (and drawbacks), it has its place in a well-designed network … but it’s neither Aspirin nor panacea.

      I would love to see switching go beyond spanning-tree limitations (and TRILL is one way to go), but I also know that the moment we’ll break the spanning-tree limits, people will start to implement L2-only networks that will eventually fail because bridging simply can’t scale … and the bigger the network, the more spectacular the failure will be.

  • http://blog.INE.com Petr Lapukhov

    There haven’t been any really good ideas in networking for quite a while. I’m surprised Radia Perelman actually put that much efforts into TRILL, as the techhology does not seem exciting at all There have been many other, more perspective approaches to Ethernet scalability (Smartbridges, CMU-Ethernet, SEATTLE to name a few). Both OTV (another technology I conside a failure) and TRILL look dully inelegant. Instead of trying to give reanimate switching, it makese more sense trying to improve routing, which has been proven to scale well at least to some extent (e.g. see researches on compact routing, location/ID separations – not just LISP, but other proposals as well. LISP sucks too

    • Dave smith

      Hi Petr, Ive was discussing the feasibility of using LAM as a means to providing a viable solution for IP mobility for years. The general feedback I hear from colleagues however is that they feel uncomfortable about the technology as there is very limited documentation and also its levels of scalability. Have you any experience or lessons learned in using LAM in this space to achieve IP mobility and as to how scalable it can be in the enterprise?

      Thanks

      • http://blog.ioshints.info Ivan Pepelnjak

        The primary scalability concern is the number of static host routes you have to carry in your IGP. The more routes you have, the larger topology database and routing tables will be.

        Another serious issue of LAM is its total lack of security – any IP host can claim any IP address (within the range permitted by an ACL, but still …). This might be good enough for an environment with a single trust zone (inside part of the enterprise data center, for example), but definitely not for a multi-tenant public IAAS cloud offering.

        However, quoting “limited documentation” and “unknown levels of scalability” – why is anyone willing to bet on TRILL then? It’s an unproven technology still works only in PPT.