Cheap Network Equipment Makes a Better Data Centre

TL:DR A recent project bought a low-cost network for the data centre. It cost less one-third of the market leader &  half the cost of a well-known merchant silicon vendors. As a result, it is planned to last for two, maybe three years before it will be replaced. From this project I learned that “fast & cheap networking” could make a big impact on new data centre designs and business attitudes. Plus it was much more satisfying as a professional project. I’m now wondering – is networking too expensive ?  

The Details

During a recent consulting engagement I was asked to advise on a high performance 10GbE network. The target was for more than 1000 ports of 10 Gigabit Ethernet at the lowest possible price while maintaining performance within a specific band. Criteria for bandwidth & latency were set in addition to setting contention requirements at maximum of 2:1 because the traffic volumes are high for the application.

This network is the foundation for cloud computing application but is connected to the existing network at certain points. It supports a large, high performance computing platform including more than 5 Petabytes of high-speed IP Storage & more than five hundred x86 servers. The physical servers have dual 10GbE for performance (not redundancy) & the server will run a hypervisor with high level of compute, memory & network utilisation.

The majority of network switch today are based on the Broadcom Trident II chipset with has 48x10GbE & 4x40GbE interfaces. Using the 4 x 40GbE for ECMP connection would lead to a content ratio of 480Gb to server & 160Gb up. This lower density might be suitable for many enterprises but is not suitable for cloud deployments where the physical server can host tens of virtual servers. Combined with IP Storage for this network, the utilisation is much higher than some networks. As always, it depends™.


The Buying Process

In order to compare vendors equally, the bid process had the following structure:

  • vendor/reseller was to offer a solution that was fully supported & their own design
  • the bid must include all cables, SFPs, accessories and must be vendor certified (no grey or 3rd party)
  • the bid includes installation, testing & handover included in the price
  • the bid must include three years maintenance and will be purchased in a single order.
  • the bidder was told that price was a high priority and should influence product selection
  • the bid described that very few features were required but named those that were mandatory

The Responses

I’ll highlight that most vendors/resellers were badly organised. The bid responses often did not address the requirements & the many of the bids left out cables, maintenance & installation in spite of the bid clearly stating that all pricing must be included.

The cost to the customer in time wasted because of poor bid responses was substantial. Resellers & vendors wasted a lot of time asking questions that had little relevance to our requirements (which were clearly stated in the bid documents). We wasted many hours attempting to compare the offered solutions.

Some Observations

When buying a 10 Gigabit network, it became obvious that the largest cost is “vendor supported” cables & interface modules. The final cost of switch hardware was less than the total modules/cables cost for this project. This was unexpected.

Variation in pricing on identical cables was as much 10 times between vendors. It was a critical requirement was that each vendor must approve the design & bill of materials in order to ensure a fully supported network. This means that all cables were “genuine” & vendor certified. This variation in pricing was unexpected.

The casual observation is that cables appear to include act “volume licensing” program for some vendors which possibly takes advantage of the processes in some companies but the disparity was shocking. [1]

Vendors & resellers do not know or understand how much power their devices use or how to assign a dollar value to running these devices on a yearly basis. Available documentation on device power consumption is awful. More needs to be done.

The Results

The final comparison of the bids showed a wide range if pricing. I’m not permitted to quote exact pricing so I’ll use numbers that show the relative scale.

Market Leading Vendor = $2.8MM
Alternate Vendors (multiple) with merchant silicon switches = $1.6MM
Lesser Known Vendor “B” = $800K

We evaluated the technology of the Alternate Vendors closely & then carefully analysed what functions would be lost with Vendor “B”. There are a number of tradeoff here but the savings of $700K (or $2.2MM) is a significant motivator to look at the solution carefully.

Instead of buying a branded vendor, we bought the network from little known vendor for the majority. We also some equipment (about 10%) from a second vendor. Two reasons.

  1. A dual vendor purchasing policy helps to ensure competitive pricing. Previous experience with a single vendor has led to poor pricing & support. Quote: “When a vendor has to be competitive, they stay competitive.”
  2. There are certain features on the alternate platform that we needed. So we bought the second vendor products for those features.

The Outcome

The network is now nearing production readiness & moving through the final stages of acceptance testing. There are a number of small problems with implementation due to the usual poor planning by vendor & reseller but nothing that wasn’t expected by a suitably cynical & experienced purchaser.

We have already evaluated the next generation of networking hardware from mid-market network vendors. Because the network cost was so low, the ROI period is less than 2 years. As a result the current planning is to replace the network in 2 years. In practice, it will probably creep to three years but that timeframe is still half the previous ROI.

The Unseen Impact of Expensive Networking

The previous network ROI of six years was extremely damaging to the data centre. Simply, the network equipment was so “old” that it didn’t support simple but necessary features such as RSTP & PIM. The software on the current devices was not particularly reliable & had a number of quirks that made it hard to use.

Another interesting point was that the cost of networking was so high that they couldn’t afford a network person. The network was installed & effectively unmaintained until a problem emerged. This led to poor quality outcomes for server & storage & also a number of security issues.

Designing For Change

The final choice was for an ECMP Single-stage Clos architecture for network architecture to make it easier to replace the network devices. The ECMP network design offers growth & ease of operation. The current backbone has six switches with 32x40GbE interfaces connecting to 19 Leaf Switches. The next phase of growth will increase the spine to a larger size.

We know that the current products lack many features that other products have but these were easily solved using other technologies that cost little, were free or were solved by changing the operational practice. For example, because the switches were so cheap, we bought separate devices to be exit point from the underlay network & overlay network because it was operationally prudent to have the VTEP functions in different box. & the box was so cheap, it was worth the investment.

But we also know that the next generation of network switches based on Broadcom merchant silicon, will offer new features & function that are interesting to the use case. In particular, the 100GbE interfaces will provide extra bandwidth in the spine which is expected to be important as the IP Storage traffic volume increases.

The EtherealMind View

When I first looked at the requirements for this project, it seemed obvious a mid-tier networking vendor with merchant silicon switches would fit best. The final solution was surprising.

The most interesting lessons I learned could be summarised as follows:

  • ECMP network designs are awesome. They scale, they are easy to operate & look easy to upgrade (I say, look, because I haven’t done it yet & maybe it is harder than it looks).
  • Cheap hardware changes the way you build networks. Instead of spending hundreds of hours researching & justifying a single expensive purchase, the project was able to make a rapid decision & move into implementation. It was refreshing to move through the decision process quickly.
  • Cheap also means replaceable Having a network at a reasonable cost means that the investment cycle can be radically changed. A faster investment cycle means regular upgrades.
  • Low Cost Got Business Attention Saving a couple of million dollars really focussed the business on tradeoffs. We were able to get business acceptance on many new ideas simply because the dollars made it practical & sensible.
  • We bought some spare equipment because it was cheap. Instead of counting every tiny elements & having endless discussions on components in the bill of materials we simply bought extras. Instead of hours wasted on my consulting time, we bought a small amount of extra kit & focussed on the installation. Overall cost included resource time was reduced.
  • Features in the network are “missing” but they can be worked around with the right team & DevOps thinking. Integration between the storage, VMware, Server & networking solved these problems.
  • Some features will come later. We are working with the vendors to get early versions of code to access certain features. This network design is slightly ahead of marketplace & the business accepts some tradeoffs are necessary. We have designed some aspects to allow minimal interruption when the features arrive.
  • Go multi-vendor The multivendor strategy worked better than I thought. I haven’t really experienced and multi-vendor network since the late 1990’s and was somewhat wart of the idea. Instead of wasting time as I expected, both vendors got something & were willing to work harder because the motivation of the next big purchase being just couple of years away kept up the motivation. I didn’t feel discarded like a dirty rag as has happened in other projects.

I would caution that this approach is not suitable for every company. But I think it shows that it is possible to drastically reduce the capital cost of building a data centre network. There are a number of tradeoffs in the implementation but, lets face it, saving a million bucks can make those tradeoffs acceptable to the business. At a personal level, this was one of most satisfying projects I’ve assisted in a long time because the managers and engineers came out happy. That doesn’t happen very often.

This whole exercise seems to highlight an issue that I’ve been wrestling with. Is networking is too expensive ? This project suggests that it is.

  1. Many projects buy the least amount of hardware needed for the project. Buying a 10GbE switch & cables for a few ports meets the immediate requirements. The actual overall cost of the switch is massively inflated by costly “vendor certified” cables but it’s not visible to ITIL-compliant processes because ITIL does not consider overall costs, only incremental costs.  ↩
  • stu

    Great post Greg. I’m not surprised on the cabling and modules. I’ve heard it estimated that “Market Leading Vendor” has a $2-3B optics business. Existing installed cabling is often a huge inhibitor to most enterprises looking at upgrades. Very interesting to see what a different mindset from cloud/service providers can accomplish. I’m curious if there was consideration given for the cost of training and operating the environment which is likely different gear than the networking team is used to.

    • Etherealmind

      The organisation doesn’t have a dedicated networking team and, once I finish, there will be no dedicated resource to operate the network. The design has made several choices that reduce, remove and prevent any requirement for network operations. Which isn’t difficult if you can take a holistic view of the system.

      It took about two days to explain the operational requirements to the existing server ops and they will run with it. An ECMP/Clos network requires very little operation.

  • NotAllWisdomIsConventional

    Yes! Curious if you also ran the numbers with quality generic cables and optics. For our new building in 2010 the 1Gig optics alone were $2.3k vs. $45k. Plenty of spares; none needed yet. Can swap with OEM SFP if support asks.

    Was the maintenance contract driving the vendor-certified requirement?

    • Etherealmind

      It was important to the customer to have a vendor designed and endorsed solution to be assured that solution – some vendors have previously wiggled out of commitments and this was to be prevented.

      Therefore, only vendor supplied optics and coaxial cables were considered.

      The maintenance was simple – next business day hardware replacement, software maintenance and help desk. So yes, the vendor endorsement was important to ensure that support to offered solution was directly the vendor responsibility.

      • NotAllWisdomIsConventional

        Any hints on which little-known vendor? Penguin, Quanta, Accton? Mellanox? Extreme? Or would Arista’s recent Trident II kit meet your price points?

        • Etherealmind

          The newer equipment from Cisco (Nexus 9000) and Arista was not offered & therefore not considered. The purchasing process is proscribed for this client.

  • PatG

    You really need to grammar check before posting

  • SilentLennie

    Overlays, routing and cheap switching. That is exactly what’s on my mind too.

    “look easy to upgrade”

    It is only simple if you don’t wait to the last moment to upgrade (you need to have some spare capacity while you are doing maintaince).

    • Etherealmind

      I’m being careful not to over-generalise here but simple has its own problems as does complex or active networks. The decision Isabelle out what works for the current project.

      One thing to note is that I’m not building one network in the data centre, this network is just one of six or seven networks that form the ‘data centre network’ whole. The other networks have different requirements where there are active networks with L2 everywhere.

      • NotAllWisdomIsConventional

        Love to learn about what six or seven DC networks do, and how/why they’re partitioned. Those “active networks with L2 everywhere” especially intrigue me. I’m digging into options for a second small DC/SAN site with (probably two-way) VM failover–server guys want to boot an image with same IP at the second site when first goes down.

  • J Max

    One thing I learn in the past experience and with this article was to have technology work for you and not technology be heartburn. What do I mean by that. Keep things simple and use technology to the yours and the business benefit not it’s ankle bracelet.

  • Louis-Philippe Theriault

    Using cheap networking boxes, I’m curious as to which L2 ECMP tech you used?

    I’m not seeing hints of SDN/Controller based stuff, and SPB/TRILL aren’t ready for prime time as far as I know…

    • Etherealmind

      This is one of the trade offs that was made. We use an L3 ECMP design only. I might write up more about those trade offs since a lot of people are interested in them

  • returnofthemus

    An interesting and informative read, but it appears to me that the objectives were quite clear from the start, hence ‘price’ would ultimately be the deciding factor, not all that uncommon given the current economic environment, especially with today’s plethora of options (much like shopping for car insurance).

    However, in what was a rather good piece got somewhat marred by that last paragraph. Obviously its unclear what best practice was applied to this exercise, hopefully when implemented and operational the organisation will realise the intended benefit percieved.

    Appreciating that following an ITIL-compliant process in this instance may well have elongated the process (quantative and qualitive measures), not sure how you’ve come to your conclusion, here is how it could have been applied to your scenario (though give the objective, doubt any change in the outcome):

    1. Supplier – (ITIL Service Design) (ITIL Service Strategy): A third party responsible for supplying goods or services that are required to deliver IT services. Examples of suppliers include commodity hardware and software vendors, network and telecom providers, and outsourcing organizations. See also supply chain; underpinning contract.

    2. Supplier Management – (ITIL Service Design): The process responsible for obtaining value for money from suppliers, ensuring that all contracts and agreements with suppliers support the needs of the business, and that all suppliers meet their contractual commitments. See also supplier and contract management information system.

    3. Value on Investment (VOI) – (ITIL Continual Service Improvement): A measurement of the expected benefit of an investment. Value on investment considers both financial and intangible benefits. See also return on investment.

    ITIL is a best practice IT Service Management framework, not a religion!

    • Etherealmind

      I would be pleased to write about those topics if you would like to pay me to do so. Please don’t hesitate to get in contact.

      And ITIL is not best practice, its a bunch of ideas/methodologies created two decades ago to address the problems of that era. Time to move on to new ideas and enhanced methods that address current technology. ITIL/ITSM has no relevance to today’s technology and is obsoleted.

      • returnofthemus

        LOL, alternatively you could pay me and I’ll walk you through the ISO/IEC 20000 certification process, also available for audits :-)

        First and foremost technology automates business processes and drives business innovation, the purpose of ITSM is to ensure the adequate delivery of IT services to the business, not vice versa.

        Yep two decades on, four iterations and more relevent today than it has ever been due to the increasing adoption of cloud and cloud-based services.

        Who knows, maybe in its next iteration it will place special emphasis on ‘vendor-certified’ cables and optics, though I wouldn’t hold my breath.

  • Ethan Banks

    Great, thought-provoking experience you’ve had. One thought is that some networks can do with basic functionality and do just fine. Not every network device needs to have every knob, button and lever included in the package. A certain core set of features might be plenty. If that’s the case (and you can articulate it clearly during the discovery process), then a wide number of vendors become potential suitors. I inherited some devices on my current network that are incredibly powerful, capable routers. Will I use half of their features? No. Probably not 90%. And yet, these same devices are license-crippled in the one area where I really do need these routers to shine – bandwidth. They are capped until I pay more money to the vendor. Sad.

  • Etherealmind

    TL-DR: More bandwidth means no QoS. QoS is expensive to design and deploy. A cheap network can be replaced with higher bandwidth in a 2 year cycle therefore QoS is not an issue.

    Remember that QoS technology does not scale to large networks and should not be attempted in a network of this size.

    The actual network in this project has 48x10GbE southbound and 12x40Gb northbound. When the contention ratio is 1:1 there was little urgency around QoS. In fact, the cost of implementing QoS would have a been a significant percentage of the purchase price (around $100K for engineering time in design & deployment).

    The current generation of products based on the Broadcom Trident2 are not really what we need for a high density, high performance network because of the 4:1 contention ratio in L3 ECMP design is suitable for low server density. In high density virtualization designs, the oversubscription is a major concern.

    Again, by purchasing cheap equipment that cost a fraction of the ‘traditional’ approach means that the customer will look to upgrade in 2 years when 100GbE can replace the 40GbE and there will be no requirement for QoS.