◎ What’s Happening Inside an Ethernet Switch ? ( Or Network Switches for Virtualization People )

I was going to call this article “Ethernet Switches for Virtualization Engineers” but, really, everyone should have some understanding of the internals of an Ethernet switch. But particularly I want to focus on how multicast and broadcasts are handled in a high-speed, low latency environment like a Data Centre Network.

It’s vital to understand that latency is critical to your application performance. It is common for a single transaction to take hundreds of round trips so a small increase in latency on each round trip has a large impact on the perceived performance. The client will send a chunk of data and wait for acknowledgement. Even setting up the TCP connection takes a few round trip – remember that TCP sessions are setup, and each data transfer is confirmed.

what-inside-an-ethernet-switch-how-work-1

TCP Three Way Handshake

A modern network chassis switch (at time of writing) will have latency around 10 microseconds measured port to port. For example, a Cisco Nexus 7000 is about 8 microseconds & Brocade VDX 8770 claims less than 5 microseconds. There are many reasons why a switch can be faster or slower depending on silicon, backplane, architecture but lets consider just one.

Remember, the latency interval is the time taken to receive a packet, decode the address, lookup the forwarding table, switch the packet (and copy it if needed) and transmit out of an Ethernet interface. That’s really fast processing. How does an Ethernet switch do this ?

Switch Architecture

Let consider with a line card from a Nexus 7000 switch. In this example,  an approximation of silicon pathways inside a single M1 N7K-M108X2-12L series line card from a Nexus 7000 from a Cisco Live 2012 presentation showing the module architecture which approximates the internal silicon:

vxlan-stt-switch-silicon-2

What They Do ?

What does each of those blocks is silicon chip on the board do ?

 

Switch Element Description
Replication Engine Frames that must be sent to multiple ports are duplicated and dispatched from this chip as needed (more below)
Forwarding Engine This is the chip with TCAM lookup tables and makes the routing and/or switching decisions. In other words, a table of addresses and output ports eg. an Ethernet frame with a destination MAC address 000c:1234:4567 is dispatched to Port 2.
VOQs Virtual Output Queues. This is a very high speed memory modules that performs frame queueing in silicon. Queueing is needed to ensure that the fabric is not overrun in the outbound direction. Also, packets arriving from the fabric must no overrun the MAC interfaces.
Fabric Interface chip to the switch fabric. For the NX7K, this is a five interface connection to the fabric modules on a clos switch design.
10G MAC Media Access Control for 10 gigabit Ethernet port. Think of it as the signal encoder for SFP interface.
Linksec Encryption processor for line rate cryptography if you are using Linksec.

Most of these functions should be obvious, but virtualization people considering VXLAN in a Multicast environment should have some awareness of the replication engine. 

Replication Engine

Most likely you have not have heard much about replication engines in your switches. But since VXLAN has arrived we are seeing a lot more demand for Multicast in network designs. In simple terms, Multicast is a method for a server to transmit a single packet and for the network to duplicate it to as many clients as needed.

Think about that. Your network switch is duplicating Ethernet frames at wire speed, with a latency of around 5-10 microseconds. It can do this for hundreds of Multicast receivers in the network without you knowing (or caring) how it is done.

The replication engine also handles the Broadcast and Unknown frames so that ARP frames are handled efficiently and MAC flooding during address discovery doesn’t slow down the switch in other areas.

It’s worth noting that cheaper or older switches used to perform the replication functions using a general purpose compute engine that was highly latent. It took a long time to transmit the frame to the CPU, then the processing in the network OS took tens of microseconds. I’ve seen these networks melt down under specific circumstances.

Different Approaches

A word of warning. Don’t get too attached to the details on the technology as being described above. There are a number of different approaches. For example the following image shows the engine architecture for one of the Cisco Nexus F2 switching module what-inside-an-ethernet-switch-how-work-2

 

As you can see this architecture is significantly different from the previous module but most of the functions are still in place. The majority of the difference is that the F2 module doesn’t perform routing, only switching. As a network architect, you should understand your switch architecture so that you know where the performance problems might be.

Frame Walk / Packet Flow

Many people are not aware of the complexity of an Ethernet switch. To meet the performance and latency targets requires a lot of specific features. Here is the steps through the module ( and I don’t even mention the fabric switching).

Typical Frame Flow on a Cisco Nexus 7000 M1 Module

Typical Frame Flow on a Cisco Nexus 7000 M1 Module

The EtherealMind View

An Ethernet Switch is complex and too many people think you just plug it in and it works. Because networking people are so clever we can do this. And because server guys are so dumb we don’t want to challenge you too much. :0

Be nice to your network, it’s working hard for you even if you don’t appreciate it.

Other Posts in A Series On The Same Topic

  1. ◎ What's Happening Inside an Ethernet Switch ? ( Or Network Switches for Virtualization People ) (11th January 2013)
  2. Tech Notes: Juniper QFabric - A Perspective on Scaling Up (14th February 2012)
  3. Switch Fabrics: Input and Output Queues and Buffers for a Switch Fabric (6th September 2011)
  4. Switch Fabrics: Fabric Arbitration and Buffers (22nd August 2011)
  5. What is an Ethernet Fabric ? (21st July 2011)
  6. What is the Definition of a Switch Fabric ? (30th June 2011)
  7. Juniper QFabric - My Speculations (1st June 2011)
  • Wes Felter

    You mean microseconds?

    • http://etherealmind.com Greg Ferro

      Yep. Thanks. Fixed.

    • http://twitter.com/filanthropic Ahmad Raza Khan

      Yes, it should be microseconds. The latest Nexus 3000 from Cisco has a latency of less than 250 nanoseconds.

  • John W

    Do you mean microsecond rather than millisecond?

    • http://etherealmind.com Greg Ferro

      Yep. Fixed.

  • http://umairhoodbhoy.net/ Umair Hoodbhoy

    I love the smugness!

    • http://etherealmind.com Greg Ferro

      Who ? Me ? :)

  • Michael Gonnason

    HAhaha “Be nice to your networks”

  • Infinite Monkey

    “The majority of the difference is that the F2 module doesn’t perform routing, only switching.” I thought the F2 modules added L3 support – just with some exceptions (if you need L3 performed on card, it must be in separate VDC – otherwise it becomes L2 only and must proxy L3 through M1 XL similar to F1 series).

  • BJ moore

    F2 has L3, F1 does not

  • Vikas Deolaliker

    Complexity? Did you mean commodity?

    What do you think of Intel’s seacliff fulcrum based switch chips? You could build a 10G ToR with SLB builtin with HW acceleration