My Way of Selecting a Cisco IOS Release with a Bug Scrub

Cisco has a business strategy to ship products early to deliver new features quickly this includes software and hardware. But this leads to a reputation for buggy code which means customers report bugs (and Cisco fixing them) as part of the testing process. Depending on your point of view, this means that you should never buy a newly released Cisco product unless you are willing to take this risk or you jump for joy at the idea of the new features and just go ahead anyway.

This post looks a my process for analysing this risk and then selecting an IOS version by performing a bug scrub on IOS software. In this case, I’ve been asked whether the Cisco C3750-X switches are ready for live deployment and that’s what I’ll look at.

Start

My typical starting point is to understand all of the available options for a Cisco IOS version. At this time, my favourite place is the Cisco Feature Navigator (which doesn’t need a Cisco Login).

I assume that you understand Cisco IOS Licensing and Software Packaging in 15.0 and that you have good comprehension on the Cisco IOS Software Release Model.

First step is work out what are the mainline software releases for the IOS for that platform. For low-end Cisco switches it’s always been IOS 12.2SX in the past but today the code trains are MUCH more complicated.

Choosing an ios 1 So, now I know that the mainline code train is 12.2SE and 15.0SE.

Choosing an ios 2

So lets select the most recent 15.0SE code train.

Rhetorical Question: Why does Cisco insist on those brackets in the version numbering ? Why not simply use a “.” instead? Is there a reason for obfuscating the syntax like this ? Seems silly to me.

If you understand the Cisco IOS release patterns, you’ll know that after the initial release that Cisco will receive “feedback” from customers in the forms of TAC Cases describing bugs. Typically, every four to six weeks a new point release will be ship that rolls up the fixes that have been committed to the code tree. Theoretically, the latest version will have less bugs excepting where new bugs have been introduced (sorry, a bit cynical about that).

Of course, customers will continue to discover bugs and report them to the TAC (because Cisco’s testing of software doesn’t seem to be reliable, repeatable or comprehensive or …. something) so this will continue for the life of IOS.

Choosing an ios 3

So I’ve selected a feature set/license, and I’ll click on the “Release Notes” and

Gritting teeth. Cisco IOS Licensing inspires nothing but fear and loathing. It’s an inchoate mess of “feature marketing” designed to grub a better profit margin for Cisco. A program which isn’t working because customers think they have been overcharged when they bought the hardware.

Damn. Clicking Release Notes has taken me to the Product Page.

Choosing an ios 4

Ok, maybe the website is actually working.

Choosing an ios 5

Oh. NO. IT. ISN’T.

Finding the Release Notes

Rethink the plan. Go to Support, Switches, Select 3750-X

Choosing an ios 6

Clicking, clicking, always the clicking.

Choosing an ios 7

Now we can click on the release notes.

Choosing an ios 8

The Bug Scrub

So, now I need to spend a day or two reading the Open AND Closed Caveats in the Release Notes.The Open Caveats will tell you what bugs are still known to be in the current version and haven’t been fixed. I’ll have to make a judgement as to whether they will affect my deployment.

The Closed Caveats will help me to gauge how faulty or badly coded this IOS release is. It’s my opinion that each IOS version is developed by a single team of developers, and some teams are good on testing and quality control but many IOS trains seem to ship with a lot easily detected problems.

I’ve often wondered whether Cisco’s policy of “ship early” means that programmers just ship the bugs and figure that customers will find them anyway. No need for them to work too hard at it. It’s a form of moral hazard that bothers me.

Reading the Caveats and researching the known bugs will inform me whether this IOS train is a train wreck. It’s my understanding that these are only the public bugs that customers have reported and doesn’t include the internal bugs found by TAC or SEs (these are not usually made public). Basically, I’m looking to build a level of trust in the technology on the limited information available.

Obviously, the longer a code train has been out there, the more information you will have. Don’t let the number of bugs distract you,  keep focussed on the big mistakes (called “show stoppers” or Cat 1 Bugs). Also, check the date that the release notes were last updated (in this case, 23 Jan 2012 which is about six weeks so there could be a lot more new bugs on a new product that has just started shipping).

If I need more detail about a Bug, then I’ll go to the Cisco Bug Toolkit  and look up the complete details of that bug. Sometimes this entails reading bug reports for several hours or even days. No, it’s not very exciting.

Reading the Caveats

So the Open Caveats shows some of what I call “forgot to test it before we shipped” Cisco behaviours.

  1. CSCtl60151
    • The switch might occasionally reload after experiencing a CPU overload, regardless of what process is overloading the CPU.
    • There is no workaround.
  2. CSCtn11683 (Catalyst 3560-X and 3750-X)
    • A Catalyst 3560-X or 3750-X switch port might stop forwarding traffic. The packet counters increment for sent packets, but not for received packets.
    • The workaround, to bring up the port, is to save the configuration and to restart the switch.
  3. CSCtn46265 (Catalyst 3560-X and 3750-X)
    • When you enter the copy running-config startup config privileged EXEC command on the switch, the running configuration is not always saved to the startup configuration on the first attempt.
    • There is no workaround. If you wait for a few minutes, the configuration is saved when the switch attempts it again.

In the Resolved Caveats

  1. CSCtq01926
    • When you configure a port to be in a dynamic VLAN by entering the switchport access vlan dynamic interface configuration command on it, the switch might reload when it processes ARP requests on the port.
    • The workaround is to configure static VLANs for these ports.

Some of these bugs look reasonably serious and suggest that the either the testing coverage was not very comprehensive or the code has a significant number of flaws inserted by developers. There is a strong risk element to be seen here but overall, not too bad.

What about 12.2SE software train

All right, so lets have a look at the 12.2SE code train. The 15.0SE release doesn’t look too bad, but it’s only been out for a couple of months, and there isn’t a lot of product out there.

Again, what I’m looking for is failures or bugs against basic features that are indicators for a bad overall code version. I’m not much interested in multicast bugs or weird bugs that only one person in the world would have.

  1. CSCth62705
    • If you configure an EtherChannel and add new domain members, CPU usage in the switch is unusually high. High CPU usage is seen also when you configure an EtherChannel and add EnergyWise-capable endpoints with a different EnergyWise domain to an existing domain.
    • The workaround is to disable the port channel where the high CPU usage is seen.
  2. CSCtg71149
    • When ports in an EtherChannel are linking up, the message EC-5-CANNOT_BUNDLE2 might appear. This condition is often self-correcting, indicated by the appearance of EC-5-COMPATIBLE message following the first message. On occasion, the issue does not self-correct, and the ports may remain unbundled.
    • The workaround is to reload the switch or to restore the EtherChannel bundle by shutting down and then enabling the member ports ……..
  3. CSCto14414 (Catalyst 3560-E, and 3750-E switches)
    • When you enable IP Address Resolution Protocol (IP ARP) inspection for selective Q-in-Q, IP ARP inspection drops all double-tagged packets even if you have enabled it on a C-VLAN or S-VLAN.
    • The workaround is to disable IP ARP inspection.
  4. CSCts34688
    • The switch crashes due to the “HACL Acl Manager” memory fragmentation when a large access control list (ACL) is modified.
    • The workaround is add or remove ACE entries in sequential order when the ACL is modified.
  5. CSCth87458
    • A memory leak occurs in the SSH process, and user authentication is required.
    • The workaround is to allow SSH connections only from trusted hosts.

Conclusion

In this case, it seems clear to me that the C3750X should use 15.0SY release train. My logic here is that the product is new, and the 12.2SY train is not likely to get a lot of support for the long term future product. It’s really there for people who don’t want to upgrade for fear of 15.0 code. There are many companies that would regard IOS 15.o as a major change in standards and will not permit it’s use. (I make no comment about this).

The IOS15.0SY doesn’t seem to have too many problems. The bugs look reasonably low order, and not relevant to my use case. So I’ll select 15.0.1SE2 version with some confidence.

The EtherealMind View

You can’t protect against bugs in software in any vendors equipment, but Cisco seems to appreciate their customers time and energy in performing and testing for bugs – I’ve done enough of it over the years. I’ve often wondered if there is a moral hazard 1 effect here. Is Cisco  doing enough to make quality code ? I’m not so sure and I’m quite tired of it now. If I’ve spent a large amount of money purchasing a product, do I then have to test it to check it is fit for purpose ? This process is time consuming, boring, and does little to actually quantify and address the underlying risk. At best, I can make an educated guess on whether the product is ready for mainline based on ten or fifteen years of bitter experience.

I know about the Cisco Safe Harbor programs which offer certified ‘safe’ versions. But these are two or three year old IOS releases that have been proven to have reached a near zero defect reporting level aka no one is finding any more bugs. This also means old hardware and limited features.

If you have a reseller who sells a fair bit of Cisco, then sometimes they can help you. Of course, the reseller will need to have sold and installed that specific model of hardware/software, and deployed it and managed it for some time to be able to give you an honest answer. In my experience, resellers rarely have the time to follow through on questions of reliability and, when challenged, they usually say “Well, no customers have complained” or something that makes me lose faith in the answer.

To my knowledge,  Cisco, or indeed any other vendor, doesn’t publish any good data on software performance or hardware defects on a per product pr per version basis. I’ve often wondered why this is ? The medical industry expects to receive statistical product performance information, why don’t we get something similar ?

Currently my favourite method is use my Twitter account to ask people for their opinions – I’m lucky to have a few thousand smart and knowledgeable people who can give me good answers based on their personal experiences. I do try to retweet people who ask those questions as well.

I hope you’ve found this useful. I’m certainly interested in hearing how other people resolve this problem! Is there a service or capability from Cisco that I don’t know about ? Are there better ways to decide the best version ? Please leave a comment and share your experiences.

 


  1. The risk that the presence of a contract will affect on the behavior of one or more parties. The classic example is in the insurance industry, where coverage against a loss might increase the risk-taking behavior of the insured. In this case, developers get lazy because customers will report the bugs and we can fix them later.
About Greg Ferro

Greg Ferro is a Network Engineer/Architect, mostly focussed on Data Centre, Security Infrastructure, and recently Virtualization. He has over 20 years in IT, in wide range of employers working as a freelance consultant including Finance, Service Providers and Online Companies. He is CCIE#6920 and has a few ideas about the world, but not enough to really count.

He is a host on the Packet Pushers Podcast, blogger at EtherealMind.com and on Twitter @etherealmind and Google Plus

You can contact Greg via the site contact page.

  • Will

    This post is outstanding!!!  I bet it becomes the benchmark upon which many readers will determine the roll out of code.  I know it will here.

    Thanks so much.

    • http://etherealmind.com Etherealmind

      Thanks Will.

  • Stefan Mititelu

    Now take this entire very well documented and written article and try to guess what would change by replacing “IOS” with “NX-OS” … well … the latter (single train, so far) is actually an ongoing set of bugs with some working code associated with them.

    • http://etherealmind.com Etherealmind

      No comment…..:)

      • Josh

        Say what you will, and make your jokes, but i’ve been running 2 n7k’s for over a year and a half and haven’t experienced a single bug…

        • Mike

          Josh I’m thinking you’re the exception, not the rule

  • Ashley Young

    So, what we need then, is a Metacritic for IOS Releases?

    • http://etherealmind.com Etherealmind

      Yeah. Except there are too many versions of IOS for that to be useful.

      There are only about 1000 movies a year. I’m guessing that IOS has about ten thousand a year. :)

  • http://majornetwork.net/ Markku Leiniö

    Greg, you may want to use the “search and replace” again :-) Apparently you first wrote the article about Cat6500 because there are so many references to 12.2SX and 15.0SY, which only exist for Cat6500, not for any low-end switches like Cat3560/3750 series.

    • http://majornetwork.net/ Markku Leiniö

       (ok “low-end” was relatively speaking here)

    • http://etherealmind.com Etherealmind

      As I pointed out in the first paragraph, I’m looking at the C3750-X switch for the purposes of this article. Hence the references to 15.0SY.

      For other models, your mileage may vary.

      • http://majornetwork.net/ Markku Leiniö

        Any of the existing SY trains (12.2SY, 15.0SY) do not support Cat3750-X so I still don’t see the reason for mentioning them in this context.

  • Tom Hill

    I was on the phone with a Cisco Sales Engineer the other day, and heard him state that Cisco was a software company. I’d have to agree with that, just wish they were a good software company.

  • JamesWGreene

    Greg,
    Excellent and spot-on article. One thing we often find to be a challenge is when we are planning to upgrade code on a particular device and when reviewing the release notes come across a particularly vague bug (“under certain conditions device will reboot”). Many times we will press the vendor to elaborate as we would like to know something about the conditions and how much they might affect us, with varying degrees of success.

    Have you run into this and if so how successful have you been at getting more information? I find speaking to our SE has been better than going to TAC, but I was curious if anyone knew of a more effective way to get answers.

Subscribe For Weekly Updates by Email

Get a Weekly Summary of Latest Articles and Posts to your Email Inbox Every Sunday

Thanks for signing up. Look for the email from MailChimp & make sure you confirm your email address. You may need to check your spam or gmail settings to be sure of receiving the email.

Note: You can unsubscribe at any time using the link at the bottom of every email.