Blessay: On Stackable / Fixed vs Chassis / Modular Ethernet Switches

The debate on Stackable vs Chassis based switches has a long and proud pedigree. Ever since the first switches arrived on the market, established vendors have always produced chassis based switches for the higher end of the market for performance and reliability. The cheap manufacturers, and new entrants to market have always produced stackable switches and, inevitably, claimed how fantastic they were compared to chassis based switches.

What’s interesting is that those same companies (provided they stay in business for long enough) always make a chassis based switch in the end. Case in point, HP ProCurve.

When customers are confronted with the question of which to choose, the cheap vendors smile and point to the price tag. The chassis vendor would then attempt to educate the customer on the advanced technologies and the enhanced features in their device. Mostly, the customer didn’t understand, worked out that stackable switch are less than half the price per Ethernet port and raised a purchase order.

The next time around, the same customer always bought the chassis based switch. Once bitten, twice shy. Let me try to put some meat onto the discussion.

Well, at least I’m gonna try.

The Zen Master was meditating over his network, at oneness with the Flow. A student approached the master and asked “May I not stack the Cisco C3750 switches to create more connections to the Flow ?”.

The Master looked carefully at the young apprentice and said “It is believed that the sum of the parts is greater than the whole and that combining many into one creates more Flow.” And the student nodded, because that was his thought.

The Master smiled and then said “But that which is many, always remains many and is never truly one”

And the student was enlightened.

http://etherealmind.com/zen-stackable-chassis-switches/

The manufacturing quality of fixed format switches is lower than chassis.

Chassis based switches are designed, manufactured and built for much higher MTBF and MTTR than a fixed format switch and this is reflected in better software defect ratios, and lower hardware failure rates. Hardware performance is improved to due to better airflow, better design, more testing and quality assurance.

In particular, the software quality of chassis based switches seems to have much lower defects, bugs, fixes and patches. I currently believe that this is due to software simplicity of building a single OS compared to a distributed operating system. It could also be that the additional cost of the device allows for the development of better software and better testing.

More elements means less overall reliability.

This is counterintuitive and most people don’t get this until it’s pointed out.

Having six power supplies instead of two means greater chance of failure. Lets assume that both stackable and chassis power supplies are of the same quality (not true, but lets assume) and have the same chance of failure. Since there are six units that could fail at any time there is three times the probability of a failure. That is six chances versus two chances means it is three times more likely to have a power failure in a stack of switches.

This impact gets worse when considering the Time To Repair. That is, you must replace the entire switch and possibly, restore the configuration in a stack, or the firmware. On a chassis, the configuration is stored centrally and not lost. A replacement line card (provided its the same model) will return to service almost immediately.

Bandwidth Limitations and Shared Bus Architectures

The Cisco Stackwise product documentation indicates that the C3750 is only 32Gb/s full duplex bus or 16 Gb/s in a single direction (although they claim 32Gb/s since it is a counter rotating ring in a stunning piece of marketing math. Therefore a stack of eight 3750 switches are sharing a total backplane capacity of 16GB/s, which isn’t a lot. Calculating bandwidth of a shared bus is notoriously difficult but you might say it’s less than 2Gbps per switch for a fully loaded stack.

To efficiently load balance the traffic, packets are allocated between two logical counter-rotating paths. Each counter-rotating path supports 16 Gbps in both directions, yielding a traffic total of 32 Gbps bidirectionally. The egress queues calculate path usage to help ensure that the traffic load is equally partitioned.

Whenever a frame is ready for transmission onto the path, a calculation is made to see which path has the most available bandwidth. The entire frame is then copied onto this half of the path. Traffic is serviced depending upon its class of service (CoS) or differentiated services code point (DSCP) designation. Low-latency traffic is given priority.

When a break is detected in a cable, the traffic is immediately wrapped back across the single remaining 16-Gbps path to continue forwarding.

Shared Bus design are also very limited in throughput. The reason vendors like to implement Shared Bus technology is that it is cheap and easy to build and has a high clock speed / headline data rate. But the difference between actual goodput and throughput can be a lot. That loss of bandwidth means a slow network.

If you combine the risk of bus congestion at critical overload events, and the impact of electronic failure (which typically fails the entire bus), I don’t have a positive outlook on Shared Bus Architectures.

Reliability and Availability are NOT the same thing

Conceptually, Cisco claims that the loss of single switch of a stack will not cause failure of the entire stack and therefore this will provide greater AVAILABILITY. Availability is not the same as reliability. Because failures happen more often, your network needs more High Availability features to compensate such as redundant uplinks, STP optimisations and fancy tuning, routing protocol optimisations such as ECMP and BFD.

Software reliability (Failure in a stack is typically / often takes the entire stack down)

Taking this one step further, field experience shows that losing one switch in the stack, often causes the entire stack to fail or perform badly until the faulty switch is powered down or removed. The level of software integration that occurs when the stack forms seems to be quite challenging. This is my actual experience over the last fifteen years and no matter what the vendors tell you, that the actual experience. Every vendor has told me that this isn’t possible and assure you that that one unit can never bring the stack down but that is exactly my experience. Nortel, 3Com, ProCurve, Cisco whatever.

The Stack Connector

The Cisco cable that is used to connect the Cisco 3750 hasn’t worked very well in practice. The connector is very large and heavy and seems cause a physical connection problem. My experience suggests that you need to reseat the connector every six months or so to stop mechanical failure of the stacking cables. This requires a outage to the stack (even though it shouldn’t).

The same applies to all other stack vendors. Those physical connectors need to handle a high speed electrical signal within very very tight parameters and are easily affected by stray RF or physical degradation such as oxidation. I have always expected that chassis backplanes should be affected by the same problems but that doesn’t seem to be a problem. I’m guessing that the better physical environment of a backplane is more conducive to better connections.

Feature Poverty

There are many features available in the Catalyst 6500 that are not available in other switches. Because chassis switches have bigger processor engines, they are able to handle more features. This includes basic features such as large number of VLANs since there is enough CPU to handle STP and BPDU generation, FHRP protocols, faster timers on routing protocols such as OSPF (250milliseconds hello timer for OSPF on C6500) as well more advanced and less common features such as MPLS. This experience is the same for, say, Nortel ERS8600 which has a superior feature set to their (( I am not completely up to date with ProCurve and cannot comment on their equipment ))

Conceptually, a single software process that controls the entire SINGLE device is a better technical choice than attempting to form a coherent set of discrete elements into single LOGICAL device aka form a single switch into a single stack. This software complexity is what leads to feature poverty (and higher failure rates as discussed previously).

Etherealmind’s view

With all of the above, you might think that I don’t like stackable switches. Technically, you are right. A chassis based switch is ALWAYS more reliable – physical, software and operational and has more capability, performance and features. As a designer, I need to be careful of the budget. So if the business criteria mean that money is an crucial factor, then stackable is a viable choice.

But, cheap is as cheap does. Don’t expect a outstanding experience with stackable switches and be happy with what you get.

If you are happy with Stackables

If you are happy with your Stackables, then fine. But I would bet money (and I’m not a gambling man) that you aren’t doing anything challenging.

  • It all depends

    Stack vs Chassis really depends on port density and cost. I have always been under the belief that more than 4 switches in a stack should be a chassis. If you need that many ports deliver it in the most efficient manner which is a chassis. I think people get lost in all the numbers that get throw around and the hopes of something better.

    While 6 3750s would list for about 110k a 4510 with the same port density is about 95k and still has two more slots. I would take the 4500 any day (I am not the biggest fan of that product as a 6500 isn’t that much more money).

    The bigger challenge is convincing the bean counters that the initial investment is the right one and even though you don’t need it today the capacity is there for the future.

    • http://etherealmind.com Greg Ferro

      The point of stacking is that you grow into them. I have seen an entire datacentre built around a eight-deep stack of C3750!! It’ just kind of grew that way and when the problems started, they couldn’t work it out why.

      Upgraded to C6500, and problems went away. Broke the C3750 down into top-of-rack and they worked well too.

      Stacking, it’s not so great in Cisco land.

    • Sam Crooks

      Actually, by my calculations, when you consider the TCO of the capital purchase of a given number of stackable switches AND the annual maintenance cost of 8x5xNBD SmartNet and compare this to the capital purchase and maintenance of a chassis switch with comparable line card density, the chassis is less expensive iun densities above 96 ports.

      Most people don’t consider the increased maintenance cost over time on stackables vs a fixed chassis, nor the fact that the annual maintenance costs increase as more switches are added to the stack, while a chassis has a fixed maintenance cost no matter how many line cards are added.

      That said, some times lower upfront capital cost is more important to the business.

      I have not found that chassis switches fail as much as you say, and have not had problems with stack connector fail requiring reseating, nor bandwidth when appropriately used (appropriate use of a 3750 switch stack is as an access layer for end users, IMHO).

      I have seen many instances of fools who built ‘core’ networks and collapsed ‘core’, ‘distribution’ and ‘access’ all on a single stack of 3750s. Even saw critical data center Layer2 interconnects built with 3750s. All eventually replaced to resolve limitations of 3750 (key limitations include: small TCAM, and maximum of 32 interfaces with HSRP on them).

      • disqususer2

        Sam,
        I would love to see the numbers you used to come up with this math

  • Ramirezh

    What is your take (if any) on Juniper’s stacked switch implementation (virtual chassis)

    • http://etherealmind.com Greg Ferro

      I have had a look at the Virtual Chassis Technology Best Practices document available on juniper.net. From this I would say that everything I said previously applies.

      However. I have never received any briefings, technical documents or deep dive into the JunOS software and have no view on how the software architecture looks. I know that JunOS is distributed OS and may have features that inherently allow distributed processing (compared to Cisco IOS-SX software which does not). With those caveats, I still feel that stackable technology is always weaker than chassis.

  • http://ccietrek.wordpress.com/ Jeff Rensink

    For the user closet, I think you’re right that chassis based switches are the way to go if you need larger port densities. In the Cisco world, if you’re doing 1Gb to the desktop, the 4500 or 6500 series becomes competitive after 3-4 blades.

    But for the datacenter, with the Nexus line, I think we’re going to see the virtual chassis becoming the norm. With the introduction of the 2K Nexus line, many of the negatives that came with the 3750 series stackables are eliminated. Port costs go way down and availability should be much higher due to them being able to be dual managed.

    Once the new models come out later this year, I think the Nexus line and the virtual chassis model will start to usurp the 6500 as the dominant platform in the datacenter. That is assuming that the Nexus line lives up to its promises.

    • Rob

      Greg,

      Have you posted any thoughts or blessays on the Nexus line? Cisco is marketing them up harder than I’ve ever seen their sales force work. I’ve heard little on them as of yet from the customer base.

      Rob

      • http://ccietrek.wordpress.com/ Jeff Rensink

        I think the reason you haven’t heard much about them from the customer base is because they aren’t that easy to get ahold of. There are definitely long lead times on this hardware.

        Look for a bunch of new hardware to come out over the next 2 quarters across each of the Nexus hardware platforms. I think the combo of what will be available by the end of the year will really make the Nexus line the platform of choice in the data center.

      • http://etherealmind.com Greg Ferro

        I have information on the Nexus 7000 however it is under NDA and I can’t write about it. I’m not exactly when I would be allowed. Any Cisco Marketing who can get in contact and let me know what I can write ? contact or email myetherealmind — at __ gmail-com.

  • http://blog.michaelfmcnamara.com Michael McNamara

    We’ve actually moved toward stackables to help ease our cable management headaches. Originally we found that stackables were much more economical in terms of cost per port. However, chassis solutions have definitely come down in cost and are now very cost effective.

    Using Avaya/Nortel Ethernet Routing Switch 5520s we deploy in stacks as large as 8 switches (384 ports). We place the stackable switches in between the 48 port patch panels and then utilize 1 foot patch cables between the panel and switch. This helps greatly in terms of cable management and cooling and also great speeds break-fix replacements. It’s worked wonders for us… especially where you have multiple departments/people working in the closets and they aren’t all equally motivated to keep the cabling plant neat and orderly.

    In the old days you’d have 300+ lbs of cabling hanging in front of the chassis based switch. It would take you 60 minutes just to label all the cables to replace a defective blade/module forget about the logistical task of actaully geting the blade/module out of the chassis with 300+ cables draped in front of it.

    Thanks for another great article Greg!

    Cheers!

    • http://etherealmind.com Greg Ferro

      Possibly that because of the pricing model on the ERS 8600. Nortel charged an excessive premium on their chassis and made the stackables look cheap. If you believed in the Nortel story then you tended towards stackables.

      However, I found the Nortel stackables unreliable and generally poor feature-wise. I moved away the entire company as a viable supplier at that point (combined with the poor tech support outside the USA) so I can’t really say much about them.

      The cabling problem can easily be solved on chassis by using deeper racks and putting the patch panels at the rear of the switch and running cables from front to back. Much better than stackables.

      • http://blog.michaelfmcnamara.com Michael McNamara

        The Cisco The ERS 8600 was no more expensive than the Cisco 6500. At that time the great debate was Layer 2 vs Layer 3 in the closet. Nortel and Cisco both wanted everyone to deploy Layer 3 in the closet, however there just wasn’t any justification. As a result of the Cisco 4500, Nortel developed the ERS 8300 to compete with Cisco in that price point. In my opinion that’s when chassis based solutions really became cost effective and a viable option, depending of course on density.

        I’ve utilized Nortel stable switches for the past 13 years and while they certainly have their issues, what equipment doesn’t, they are very reliable in my opinion and have served us well.

        If anything thought I can’t understate the impact that having 1′ patch cables in the closet. We’ve had numerous cabling solutions from companies such as Panduit and Ortronics and in the end we always came back to a closet three weeks later to find a gigantic mess. While that’s been our solution it certainly won’t be everyone’s solution.

        Cheers!

    • http://www.packetconsulting.com Packet46

      Greg,

      Fair point but what about using ‘telco’ style cabling with a chassis based switch.

      Fix the cables into the RJ45 switchports and run the cables via the sub-floor void into an adjacent rack. These fixed cables are presented into the top half of the cable rack and cables from the client/server end are presented into the bottom half of the rack. That way all patch cabling is within one rack.

      Normally used in big data centres but can be efficiently scaled down to suit localised deployments.

      The cost of an extra rack can always be mitigated against a clean, tidy, dependable and traceable cabling infrastructure where effective troubleshooting can be achieved without risking unplanned outages to other services.

      Cheers
      Dave

  • http://www.standalone-sysadmin.com/blog Matt Simmons

    I would think the management overhead would be enough to make me want to move to chassis-based switches, if I needed that kind of density.

    • http://blog.michaelfmcnamara.com Michael McNamara

      The Nortel stacks are managed as a single device with a single IP address with the switches in the stack appearing as cards or modules.

      Cheers!

  • Pavel Bykov

    Here are my inline commentaries about this post:

    You wrote: The manufacturing quality of fixed format switches is lower than chassis.

    I say:
    There are many considerations here, and this ends up as a speculation. Just consider 2500 (or 2960 from Sasquatch family). Manufacturing quality is just that, and if it is somewhere higher or lower depends on many factors. Take Mars Rovers as an example, manufacturing quality of which was predicted to provide for 3 month. Sure, I understand what you are saying, but I just don’t see how can it be true. I’ve seen 4500 start burning, and 6500 start shorting, and 12000 that would not turn on. Described pattern of better manufacturing quality was not observed.

    You wrote: More elements means less overall reliability.
    I say:
    Your second claim directly contradicts the first one. Chassis intrinsically have more elements than fixed chassis. Therefore in this case you claim that they are less reliable.
    And yes, theoretically I agree. But then again, this is too simplistic statistic to be relied upon in practical environments. Or as Mark Twain had put it: “There are three kinds of lies: lies, damned lies, and statistics.”. Ironically, one of the most failure-prone device that I have seen was an O/E transciever, that was only converting optical to electrical signal. You can’t get much less elements than that, and yet there it was. One failure a week. There are many more practical examples like that.

    You wrote: Bandwidth Limitations and Shared Bus Architectures
    I say:
    Here you just get it wrong, because you are only partically right. You are right about 16Gbps, but you completely miss the fact, that there are TWO counter rotating rings. It’s a two-ring system, and each of those has 16Gbps, therefore together they do really have 32Gbps. Want proof? Just take apart the 3750 and look at the chips. You’ll find there Maxim MAX3780 chips, which provide Quad bi-directional interface of 2.5 Gbps per channel. That’s 8G of clean bandwidth per chip. There are four of these in total. And they are bidirectional… No matter the way you look at it, it’s not 16Gbps.
    And, besides, you are just being plain ignorant here. What about 8:1 oversubscription of 6148 cards? What about 6Gbps per slot of 4500? What about the simplistic SERDES that are not even pass as an excuse for any sort of intelligence?
    The truth is, every platform has it’s bottle neck. And stack bandwidth is not really a good example.

    You wrote: Reliability and Availability are NOT the same†thing
    I say:
    Yes, but then again, conceptually, there is no protection against a failure of one linecard. If LC fails, all connected ports will loose service. There are no tricks and no magic.
    Software reliability (Failure in a†stack is typically /? often takes the entire stack†down)
    How many time a dual-supervisor device failed in a way, that took down device as a whole, and switchover did not take place? Just as many as stack failures?

    You wrote: The Stack Connector
    I say:
    That’s your experience. I found that if I screw the connector on very tightly, it works after 4 years just as well as the first day.

    You wrote: Feature Poverty
    I say:
    Comparing anything to 6500 can be a sin. There are very few hardware platforms that can offer so many fatures. Why didn’t you compare the 3750 with 4500? 4500 offers worse features in some cases than 3750. Two-level shapers (shaper in shaper) is a sci-fi for 4500/SUPV, and so are the queue lengths, and drop thresholds. Sure they are a bit on the heavy side control wise, but 4500/SUPV cannot even shape properly.
    And besides, 6500 is a chassis. It’s empty. It only has clock, mac addresses, connectors, and the wiring. There are no features in 6500 chassis. That what you should have written about – comparing it to line cards and supervisors – where the switch intelligence actually is.

    As the final note, it’s just wrong to look at platforms this way. The zen master should have asked the student this question: What differentiates one switch from two?

    • Kiwi

      Re: Quality
      My experience has been that the Cat3750 is one of the least reliable products I have used from Cisco. We run ~2000 Cat3750′s worldwide and their favourite tricks are dead flash or a dead stackport. I would guess at around a 5-10% failure rate versus 2-3% on the ~200 Cat4500′s.

      Re: Reliability
      I have two wiring closets – one with a 4506 and one with Cat3750′s. If a power supply fails in each closet, how many ports go down?

      Re: Shared bus
      If I lose a switch of a stackport in my Cat3750 stack what backplane speed do I have now?

      Re: Feature poverty
      Netflow – although I admit Cisco can’t make up their mind whether the Cat4500′s should have this or not.

  • darkfluid

    I admit, the 3750 doesn’t stack up to the 6500, but I always deploy them instead of 4500s, I have had no problems with the stack ports, it’s easier to distribute budget where growth is needed, they provide better features than the 4500, and now that we are starting to deploy the 3750x, the much needed features from the 3750e that we could not afford (redundant power) are here.

  • Jim Harris

    I don’t think the 3750 is a very reliable product when used in a stack. I have had two stacks go south on me in the past 6 months. One had a corrupt config and the other had a bad stack port. In the one case the whole stack of 5 switches went down and in the case of the failed stack port I had 3 of the 6 switches fail until I could diagnose and remove the failed switch. Even then on the other two switches that were part of the failure I had to wipe out the configs and rejoin them to the stack before they were back online. I say go with a chasis if you have the choice. A 3750 is fine as a standalone access switch.

  • BudgetNotUnlimited

    My experience has been somewhat different from the OP.

    For a new product line, we deployed stacked 3750 chassis are top of rack switches. The long stacking cables are perfect for doing an every other rack connection and then looping back on itself. This saves real estate and power vs homerun/patch panel/chassis switch deployment. Now, you’re correct that the 32 Gbps limitation is a design consideration.

    3 years later and we have had no issues with either stacking or switch reliability. Is it possible, sure. It’s also possible for configuration on any flash to become corrupted. That’s life. is the flash/NVM on the 6500 more reliable? I’ve seen no data on that but I suspect there’s no difference. You’re correct that more parts = more failures. What you fail to mention is that, if the network is well designed, these failures will be limited to impacting the directly connected, single homed devices (and I can tell you that replacing a 3750 with a spare is a lot easier than replacing many components on the 6500. )

    The 6500 can go faster and support more granular levels of control. The questions that always needs to be answered are: “How much control does this deployment need?” and “How much aggregated bandwidth capacity does this deployment need?” The cost differences for the device lifecycles are considerable. The 3750 has limited lifetime warranty and no cost code updates. i can spare multiple 3750′s for the cost of a single year of maintenance on the 6500. Stackables can be grown into and the chassis is mostly capitalized up front. Add in the cost of electrical power for the TCAM and saying they are comparably priced, is absurd. (Even moreso when looking at the 3750-X series)

    When given a mission critical component, I’d rather have my own spares *in hand* than 4 hour depot service.

    The vast majority of 6500′s I’ve seen deployed have been performing almost no advanced features. They have been purchased by enterprises with a lot of cash and who didn’t really question the recommendation. if you’re going to deploy less than 20 VLANs and your use of QOS is limited to voice, why buy a 6500 as anything other than a core?

    The article implies chassis switches are always the best answer unless you’re poor. That’s not even a good generalization. Use the right tool for the right job. It’s intellectually lazy and dishonest to your customers/employer to overengineer solutions just because its simpler.

    Use a hammer for nails and use a screw driver for screws. If people bought cars like companies bought network gear, we’d all be driving ferrari’s and hummers.

    • http://etherealmind.com Greg Ferro

      It always surprises me when people tell me that they have no problems when stacking C3750. I’ve had nothing but problems. I guess this means that the quality of manufacturing may not be consistent And different people have different experiences of the product still notwithstanding your experiences what you’ve had and will have to see what the future holds. Ethan Banks Also has some stories to tell about poor experiences with C 3750 switches.

      I know the method of deployment that you use and it’s a good design. fan of this myself.

      all the best

      Greg’s

  • Matt Hobbs

    “What’s interesting is that those same companies (provided they stay in business for long enough) always make a chassis based switch in the end. Case in point, HP ProCurve.”

    To be fair to HP, they have had chassis based switches from at least 1996/97. AdvanceStack 2000, HP ProCurve 4000/8000M, 4100gl, 4200, 5300xl, 5400, 8212, etc. 
    Sure, until recently most of their chassis designs only offered redundant PSU’s and fans. But from a hardware reliability point of view, they very rarely failed.