Tech Notes: Juniper QFabric – A Perspective on Scaling Up

Juniper QFabric is a new approach to Ethernet Switch Fabrics. When it was announced last year, it was clear that the underlying physical design is a completely different approach to building Switch Fabrics. Here I’m taking a research based approach to understand how Juniper QFabric is different from all other approaches to the problem, and also a look at some of the challenges ahead.

If you aren’t familiar with Ethernet Fabrics then this post examines The Definition of a Switch Fabric and introduces the concept of a crossbar switch fabric. You may want to understand the concept of lossless forwarding through a silicon fabric which I’ve written about here: Switch Fabrics: Input and Output Queues and Buffers for a Switch Fabric and also Switch Fabrics: Fabric Arbitration and Buffers.

Standard Chassis Layout

A typical chassis has a physical layout that looks something like this:

Qfabric depth 2

Each line card connects to central backplane. The backplane consists of individual channels to each blade and down to the crossbar switch fabric on the supervisor in the chassis. All of the line cards then connect to a single input & output of the fabric via a connection on the backplane.

The net effect is something like this:

Qfabric depth 1

This means that single lossless fabric is determined by the size of the chassis and by the size of a single silicon chip. And crossbar fabrics are complicated, hot and expensive which limits the maximum I/O ports on a single chip. For more I/O the cost of the switching chip increases exponentially as the silicon die increases in size.

The fundamental computing solution to this problem is to use a multistage switching architecture that wires the outputs of one switch chip to inputs of another which allows a relatively small silicon chip to be scaled up to a much larger solution.

Qfabric depth 3

For those people who are native Cisco-speakers, this is exactly how the Nexus 7000 is architected. The middle stage is the Fabric Modules (often known as FAB’s), and each line card has its own fabric. Here is slide from Cisco presentation deck showing the Fab-1 connections to Nexus line cards.

Qfabric Nexus 7000 4

All of this happens inside a single switch. This has been the limit of scaling switch technology and even this is a recent innovation. The connection from the “Line Card” to the silicon is done using a high-speed backplane but scaling beyond a single switch is done using Ethernet interfaces. Hence

QFabric Scales Further

In my view, Juniper QFabric uses what I call the “Exploded Chassis”. It takes the concept of “Line Cards” from a Chassis and places those functions into a rackable switch – these are termed QF-Nodes1.

Qfabric 5

The QF/Interconnect is the Silicon Fabric for the “chassis”. There are two broad types of line card

  1. Interface cards that have 10G and 40GB Ethernet on board that connect to QF/Nodes
  2. Silicon Switch cards that have the fabric chip and high-speed internal connection to other fabric chips so as to form a Multistage Clos Fabric.

The “Fabric Interface cards” 2 have a necessary Ethernet chipsets to connect to the QF/Nodes. Because the 40G & 100G Ethernet standards are still progressing it’s a good value to have these as replaceable assets as the next generation of SERDES units move to 25GB/s lanes.

In the same way, upgradeable Fabric Cards allows for new chips to be added as silicon can get smaller, faster and development/testing is completed. Cisco has updated their Fabric modules from 96GB to 550GB per slot and it’s seems reasonable that Juniper will do the same.

Exploded Chassis Design

Instead of having all the necessary elements in a single chassis, Juniper have created a “chassis” that spans the entire switched network. Importantly, the backplane can scale out much wider than just a single pair of chassis in traditional designs that use MLAG or Borg type designs. The width depends on the number of uplinks from the QF/Nodes & QF/Interconnect according to the capacity of the multistage close fabric architecture (there are practical limits) – currently it’s 4 x 40G per QFX3500 but in future versions I’d expect up to 16 x 40G to be possible. Of course the use of 100G allows for a different dynamic that uses less inputs.

It should look something like this.

Caption Text.

Juniper QFabric - Scaling Out Backbone.(Click for a full size image)

 

The Control Plane

In order to scale out the Forwarding Plane into multiple units,  the Control & Management Plane is held into a separate elements. It seems impractical to have the routing and switching protocol software running on the QF/Interconnect — their function is to perform high-speed, low latency forwarding.

And that’s the purpose of the QF/Director. The solution requires two which operate as Active/Standby engines. All BGP/OSPF/IS-IS/STP/TRILL etc is handled from these components. In a real sense, this is a Controller based network because the QF/Director will update the forwarding tables for the QF/Interconnect and QF/Nodes based on the protocols deployed – not unlike a switch supervisor or SDN controller.

Caption Text.

Adding the QF/Director.(Click for a full size image)

The QF/Director uses an out-of-band network to build reliable connections to all elements in the “virtual chassis”. This network is Juniper EX-Series switches with some specific design requirements that you need to fulfil.

Now we can see that the entire QFabric infrastructure looks like a traditional chassis, except that it is made up of many individual elements.

Qfabric 8

Proprietary Backbone

The backplane between the QF/Node and QF/Interconnect is proprietary. Some have attempted to paint this as negative feature. However, within any switch chassis, all the connectivity, connectors and protocols are proprietary. For QFabric to function, the internal Ethernet connections must be proprietary to carry signalling data from the QF/Node to the QF/Interconnect. That is, the forwarding decision is performed at the edge of the network in the QF/Node and the designation is tagged onto the Ethernet frame (and much other data I’m sure) so that the QF/Interconnect can switch the frame to the output port at high-speed. So, sure it’s a proprietary backbone – it has to be.

Cisco, Brocade and Dell etc will do the same thing internally to their own switches. It’s just not quite so obvious that inside a switch it’s a closed network :).

Hard, Really Hard

I’ve had discussions with vendors who point out that making this technology work is an achievement and that its sheer genius that Juniper got it working at all. It seems some companies would never have attempted such a product. Rumours are rife that the first internal chipset for QFabric didn’t work out and that the current product is using a merchant silicon chip from Broadcom.

I’m not much bothered by this. It’s the architectural principle that the “Exploded Chassis” allows a scale out Ethernet networks of literally thousands of ports in single distributed switch offers a lot of advantages – operationally and architecturally. Of course, if you can only think of Ethernet switching in terms of a “Two Core Switches, A rooted spanning tree, and Two More Layers with a sesame seed bun” then you may not perceive the value in this design.

Big

You’ve probably worked out that this is a big solution. I’m told that the practical starting point is about 500 10GbE ports for QFabric design for the price. That’s a lot of 10GbE in today’s terms where most data centres need maybe forty or so to support a handful of blade servers.

But for Internet Exchanges, ISP, Service Provides and Cloud Hosting, there is a lot of value in this product. They are customers where big is better. Also, where easy configuration and management is also a requirement.

The EtherealMind View

First thing I’ve noticed is that Cisco perceives this product as a threat. It’s possible that once a customer buys into QFabric then Cisco is locked out of the account because it’s a platform not a device. I’m hearing feedback that Cisco has prepared and is delivering an extensive “attack the weaknesses” campaign via their competitive marketing practice. Which is unusual, as they will usually “play to their strengths” thus signalling that Cisco has a weakness here. In my view, Cisco has few strengths against QFabric for those customers where QFabric fits. For most Enterprise and Corporate customers, Cisco is a smaller solution and can be purchased in ITIL sized chunks that suits project management processes just fine. Cisco doesn’t have much to worry about the in the short and medium term.

I am concerned about the product reliability and technical excellence. There are a lot of moving parts in this system, and it will require a top-notch development process, disciplined testing, and quality assurance to bring it to market. Juniper has a good pedigree but the proof is in the execution. It’s still early days.

I also find myself returning to the question of Juniper’s commitment to the Enterprise. Somehow, I can’t shake the feeling that Juniper really only wants to work with Service Providers, Carriers etc and they continue to focus product and marketing in areas that aren’t relevant to corporate network engineers. That QFabric only for SP Marketts is something that I cannot shake.

On the bright side, Juniper has delivered a true innovation here with a whole new approach to building Ethernet Fabrics. Other companies are tackling the “Exploded Chassis” concept, such as Gnodal, and I suspect there will be more.

Finally, the centralised control plane is well placed for a secure virtualised hosting platform. Because the network control is centralised, it’s doesn’t need MPLS, or QinQ, or any other technical hack, overlay, tunnel or tag technology to provide secure separation.  It also works well for Software Defined Network overlay either via OpenFlow or NETCONF or SLAX.

Lets see more innovation in networking like QFabric.  We need it.

Disclosure

I have visited Juniper as part of the Tech Field Day which is a sponsored event. My accommodation and some entertainment was paid as part of the event. I have also hosted the OpenFlow Symposium where Juniper was a sponsor. There is no commitment to write or discuss any topics as part of these events.

The opinions expressed in this article are my own. I made them up based on the information available – I hope they are correct. If you have comments please leave below and I’ll do my best to respond when I can.


  1. In fact, these are QFX-3500 Ethernet switch that can be used as traditional Ethernet switches with STP, QoS, Routing etc etc. When connected to a QFabric, they support the pseudo backplane functions like a chassis blade.
  2. I don’t know what their real names are.

Other Posts in A Series On The Same Topic

  1. ◎ What's Happening Inside an Ethernet Switch ? ( Or Network Switches for Virtualization People ) (11th January 2013)
  2. Tech Notes: Juniper QFabric - A Perspective on Scaling Up (14th February 2012)
  3. Switch Fabrics: Input and Output Queues and Buffers for a Switch Fabric (6th September 2011)
  4. Switch Fabrics: Fabric Arbitration and Buffers (22nd August 2011)
  5. What is an Ethernet Fabric ? (21st July 2011)
  6. What is the Definition of a Switch Fabric ? (30th June 2011)
  7. Juniper QFabric - My Speculations (1st June 2011)
About Greg Ferro

Greg Ferro is a Network Engineer/Architect, mostly focussed on Data Centre, Security Infrastructure, and recently Virtualization. He has over 20 years in IT, in wide range of employers working as a freelance consultant including Finance, Service Providers and Online Companies. He is CCIE#6920 and has a few ideas about the world, but not enough to really count.

He is a host on the Packet Pushers Podcast, blogger at EtherealMind.com and on Twitter @etherealmind and Google Plus

You can contact Greg via the site contact page.

  • http://twitter.com/tatersolid Ryan Malayter

    Now that I’ve thought about it more, QFabric is impressive from a “we actually built this monstrous thing” perspective. But is there really anything *revolutionary* there? It is, in the end, just a really big chassis switch – an evolution of the norm.

    I’m actually quite interested to see how far down-market they can take QFabric. Few enterprises can handle the level of investment required of the current offering, or even contemplate needing 500x10G fabric in one CapEx cycle. A 1-2U fixed-configuration version of the QF-interconnect might make things more palatable to the smaller data-centers found in most enterprises.

    Elimination of the OOB network is also something that I suspect would be desired for smaller deployments (why is that needed when you have so many other redundant links between QF-nodes and QF-interconnects?)

    • http://twitter.com/brandonrbennett Brandon Bennett

      Juniper mantra.  Seperation of control plane and forwarding plane as much as possible although It would be nice if the management network could be built into the QF/Interconnect or better yet the QF/Directory for smaller scale networks.

  • http://twitter.com/cloudtoad Derick Winkworth

    You are mistaken, good sir, about the need for actual virtualization on this platform… would be nice if it supported the concept of virtual-switches.. each “independent” of the others on the same chassis:  each with its own VLAN ID space and spanning-tree process.    Its just a switch.  Its SSI style virtualization… how does that eliminate the need for overlays?

  • http://twitter.com/brandonrbennett Brandon Bennett

    Just like how each linecard in a 6500 actually runs it’s own scaled down version of IOS and they communicate over an internal switch to program thing like FIBs?

    Is a Chassis switch or router not a single control plane?  How about FEX, Stackwise+, Virtual Chassis.  Are these a single control plane?

    Distributed control plane would be more appropriate and thats how just about any technology achieves scale.  The QF/Directors are still in charge.

  • http://twitter.com/brandonrbennett Brandon Bennett

    For the “Service Provider” only agrument. If you look at past Juniper product lanches.  The MX and the SRX the largest box in the line was launched first and then scaled down.  For the SRX this went from a 120Gbps firewall down to a 700Mbit firewall in the size of half a paper.

    Qfabric is still new and it will be scaled down for more appropriate deployments.  Also although QFabric is often represented as the entire date center and you will get the most benefit from a wide deployment of Qfabric there is NO reason why you have to rip and replace.  You can start by buying the QFX3500s (QF/Nodes as standone) and hooking them up to your existing core.  At some point it may make sense to buy a couple interconnects and a couple of directors.  Maybe later you replace your DC core with QFabric.  

    For those who think that QFabric still doesn’t fit well or isn;t “ITILable” enough the EX line with VC is still a perfectly valid and great data center design that will look more like existing deployments.

     

    • Mjkantowski

      Yes, you are right on.

  • Lukas Krattiger

    By looking at QFabric, I’m seeing a lot of components involved which is building a whole fabric. Comparing this to a classic Chassis approach does lack a bit as within QFabric, some components are also Chassis based systems.
    Even if the N5k/N2k approach is based on a tagging protocol (VNTAG) it would be more comparable to it, not only from a port-count perspective.
    Looking at both approaches, two questions are coming into my mind:
    1. Do I really want such a big failure domain?
    2. How many point of management do I have?

    • Mjkantowski

      1. I don’t think the failure domain is that big.  You can lose a whole interconnect and still survive with 1/2 the bandwidth capacity (increased oversubscription, in reality.)  This is similar to losing one side of your VPC.  And if you scale your QFabric out with four interconnect chassis, then you are only looking at 1/4 capacity loss in a failure.  Looking at more failures, such as a top of rack QFX3500, obviously that behaves the same way in almost any set up.  Then there are things like the directors that can fail, but those are at least clustered into an Active/Active pair.  Fabric/read card failures in the interconnect chassis are similar to failures on other vendor’s gear.  It’s a decrease in fabric capacity between interconnect chassis front line cards.  Front line card failures take out a portion of your capacity for any QFX3500s that are attached to the failed line card.  Then there are the 2x EX4200 virtual chassis that are set up for MGMT plane and control plane, respectively.  I guess losing an EX4200 that was handling the control plane work for a set of QFX3500s is one of the worst failures you can have.  The MGMT plan connections would still be up (they are wired to a completely separate EX4200 VC….though these VC’s are connected to each other which brings up that next question..)  I wonder if those surviving MGMT links to the QFX3500s can be used for control plane too in such a failure?

      2. What do you mean by point of management?  Every node has a MGMT interface plugged into the EX4200 virtual chassis.  As far as points of MGMT that you have, once you have it set up, it really seems like the only thing you need to deal with is the director.  Of course, there are things like the virtual chassis for MGMT and control plane that I don’t think fall under the director.  So I guess you have those to deal with, separately.  But even that looks like just 2 switches since it’s VC.

      Overall, I think QFabric is highly redundant and beautifully thought out and executed.  I haven’t had the chance to operate a deployment, so I’m speaking only about what I’ve read and talked to people about.  I would love to get one :)

  • Mjkantowski

    Don’t forget to watch the Juniper free web based training on installing and configuring QFabric. I couldn’t figure out how to give a good link, but your path is this:

     Learning Portal Home > Training Courses > QFabric Switch Installation and Initial Configuration-WBT

    So click “Education” off the main homepage, then “Courses”, then use the pulldowns to select product category “Switching” and product family “QFabric”.

  • Pingback: Data Center STP alternatives: loop-free and no unused links. « Reggle

  • Cford

    .Hi Greg,

    I am a little late to teh show here, but I am boning up on my competitive analysis and ran across your blog post here.  There are many past and ongoign debates about this kind of architecture.  To be sure, this is not a new approach for networking……just a new approach for Ethernet networking.

    Both Infiniband and Fibre Channel has long supported both centralized and distributed management models along with fabric topologies.  In reality, the choice is not really between cross-bar switch and multi-hop ASICs….it is always a combination of both.  How do you think packets are switched internally on the ASIC itself.  There are only a few ways to do packet switching efficiently….one is cross bar, another is shared memory.  On the ASIC it will typically use an internal crossbar, especially with Ethernet where there is a requirement for lots of buffering on the ingress or egress port.

    Then the network question is…..what is the building block size and scaling model.  port to port forwarding is much more easily managed inside a chassis which typically has a local control plane.  This means that most switches have either been a “fat tree” or “clos” fabric of discreet ASICs inside the box…..or a distributed switch model with a separate “fabric” module and line cards.  Again, the fabric modules can be either dedicated cross bar switch fabrics with line cards with the edge buffering on it….or the line cards and fabric cards are just different configurations of the same ASICs.

    Brocades first “director” class switch, the Silkworm 12000 was a “fat tree” architecture where each line card carried a combination of external ports and fabric ports.  This was not a fully non-blocking architecture, but it didnt really matter for FC applications.

    Infiniband has long taken this approach with a centralized subnet manager.  The nice part about the discreet asic approach is it allows for much more flexibility in what you call a switch.  For example, there are already 10k node IB fabrics.  These fabrics can be build with hundreds of small switches or 10s of big switches……and they all work the same.  Sun actuall built a 3000 port IB switch for a few large HPC customers.  In fact each “line card” had 24 IB switch ASICs on it….and each “fabric card” had something like 32 IB switch ASICs on it. 

    One downside of cascading asics is that you will require many more “hops” and will add latency to the datapath.  For example, inside the Sun 3000 port switch, a single port to port switch could be as many as 7 hops internally.  IB handles this well as the per switch ASIC latency is extremely low.  Configurations such as Qfabric also handle this by limiting the total number of hops supported…..which ultimately limits the size of the fabric.

    large switches with fabric boards made up of separate cross bar switches are nice because they can give you very large port count and still  maintain relatively low latency from port to port as there is really only one hop port to port across the crossbar.

    Qfabric also has taken the lead from infinband….as well as the OpenFlow movement to move the path management outside of the box.  This enables a fully distributed switch architecture under a single control plane and allows for easy scaling…..alhtough the cable management is a real nightmare.  In this kind of fully interconnected mesh topology, each ToR switch also needs to connect every other ToR switch….so you get a real rats next of cables across your rack……so cable simplification is not one of its benefits.

  • Pingback: ◎ Brocade’s Data Centre Ethernet Strategy — EtherealMind

Subscribe For Weekly Updates by Email

Get a Weekly Summary of Latest Articles and Posts to your Email Inbox Every Sunday

Thanks for signing up. Look for the email from MailChimp & make sure you confirm your email address. You may need to check your spam or gmail settings to be sure of receiving the email.

Note: You can unsubscribe at any time using the link at the bottom of every email.