Erratum: On Cisco IOS and QNX

In last weeks Packet Pushers episode Show 177 – Current Practices I speculated about the relationship of Cisco and QNX. Turns out that most of what I discussed was wide of the mark and partly incorrect.

I received an anonymous email which explained that the while Cisco is a licensee of QNX the implementation uses very little of the QNX software. A company can choose to license the micro-kernel plus a wide range of other elements such as network stack, GUI, CLI and I/O modules for storage. Instead Cisco uses just the kernel in IOS-XR and develops most of the software itself.

in contrast to almost all other router vendors, where routes, ARP entries etc end up being programmed into kernel space, using kernel data structures. This is much quicker to get a basic box up if you’re a startup, but much as I love Linux, these data structures are a long way off state of the art for scale and features. Plus most code in kernel space can’t be restarted or upgraded without killing the OS instance.

So what is the value of a micro-kernel. I don’t have a good understanding of this topic and since most vendors, including Cisco, are using Linux does it matter ? We know that Broadcom, Marvell and Intel produce the software drivers for Linux that some vendors are using in their implementations. Other vendors develop their own drivers which, as I understand, improves quality or performance.

Using Linux as a microkernel provides just enough operating system to run the software drivers and applications that vendors add.

So QNX is used in IOS-XR releases which runs in the CRS and just a few other routers. While QNX may be a leaky boat attached to sinking ship, the impact on Cisco is probably limited overall.

I’m hoping to organise a podcast in the near future. If you are someone who knows something about the software architecture, then get in contact and lets see if we share the knowledge with the community. People should understand more about software architectures on their network devices.

  • Alex Davydov

    Hello Greg,

    You can’t “use Linux as a microkernel”. Microkernel system (like QNX) means that you have a really tiny core of the system in kernel space with everything else (drivers, network stack etc. included) living in user space. Linux is a monolithic kernel (architecture-wise). It requires a lot of stuff to be in kernel space which means it can’t be patched/restarted/… on the fly and you have to do a full reboot. OK for server, but may not be good enough for core router. Yes, Linux is modular, but a lot of things can’t be inserted as modules.

    Hope it makes sense.

  • James

    IOS-XR is QNX; IOS-XE and NX-OS are Linux; IOS is probably BSD based (not sure on this one). IOS-XR runs the ASR-9000 and CRS platforms.

    The difference between a monolithic and micro kernel is that the micro kernel’s only job is to facilitate message passing between processes and handle some very basic hardware I/O tasks like handing off interrupts to processes. A couple examples.

    To read/write to a file, the end-user process wouldn’t make an fopen() call, it’d actually communicate to the flie-system process. This file-system process would then communicate to the block-device process for SCSI device 7, which then talk to the kernel to do the appropriate low-level I/O.

    To communicate over TCP, the end-user process would communicate with the TCPd process (which would maintain all the TCP state necessary), which would communicate with the network stack (routing state), which would talk to the NIC process, which would then do the low-level I/O with the kernel.

    The advantage to this method is that it’s really easy to do component testing. Just take the TCP process, unhook all the cables, throw it up on the bench and run it through some simulations. It also makes the whole system rather robust. Memory corruption (via buggy code) can only affect that one service. How the process manager reacts to that crashed process can be a different story (oh no, the config manager process went away. I know! I’ll fix this by reloading!)

    The downside is that context switching is expensive. Instead of 1 context switch into the magical land of omniscience, you have 5 or so context switches between these isolated processes. For an OS who’s job is to program hardware and then sit back, not such a big deal. For an OS who’s job is data-pumping (webserver), that latency/overhead is critical. Some of that latency can be mitigated with multi-core boxes, but it’s still a concern.

    However, that microkernel architecture is part of why I can do a supervisor fail-over on an ASR-9k and my BGP neighbors are none the wiser apart from *maybe* a few TCP retransmissions and routing protocol lag. Seriously… Not even a soft restart. Each supervisor has a TCP process and a BGP process that synchronize between the two supervisors. When fail-over occurs, those processes are ready to go.

    It also aides in a distributed system like a router chassis. The supervisors each run an instance of the kernel as well as the forwarding cards. Each card then has its own set of processes. For example, each forwarding card has a netflow process, and only if it’s configured. I’m not sure on XR if the supervisors even have a netflow process on them.

    You’re somewhat right in that Cisco uses Linux as a microkernel. “Boy, you have a nice process scheduler and filesystem there. Network is rubbish, we have hardware to do most of that anyway, we’ll just throw away the rest of it…”

    That said, the linux kernel still does a lot more than facilitate message passing. It is, by design, a monolithic kernel. You can rip bits of responsibility away from it, but it’s still a monolithic kernel for *everything* else.

    In the end, I think the software development practices really matter more than the micro/monolithic kernel. In XR, there’s quite a few processes I can kill and let them restart (BGP for example). There’s also some that I can’t kill (killing the ipv4 RIB process will cause a sup fail-over, just to complicated to recover that). NX-OS on the other hand I haven’t seen it recover from a single process crash ever. Even in IOS-XE, if something goes wrong, you can likely look at the process list just to see the “iosd” process chasing its tail. XE has some components broken out into smaller chunks, but it still has this massive blob of a process that *is* most of IOS.

    This is my biggest point of contention with network vendors claiming “it’s linux, therefore it’s better!” It’s only better if you built it to be better. It’s generally a good move to use another OS for the core bits, but your job isn’t over just because you did that.

    [end rant]