Cisco has a business strategy to ship products early to deliver new features quickly this includes software and hardware. But this leads to a reputation for buggy code which means customers report bugs (and Cisco fixing them) as part of the testing process. Depending on your point of view, this means that you should never buy a newly released Cisco product unless you are willing to take this risk or you jump for joy at the idea of the new features and just go ahead anyway.
This post looks a my process for analysing this risk and then selecting an IOS version by performing a bug scrub on IOS software. In this case, I’ve been asked whether the Cisco C3750-X switches are ready for live deployment and that’s what I’ll look at.
My typical starting point is to understand all of the available options for a Cisco IOS version. At this time, my favourite place is the Cisco Feature Navigator (which doesn’t need a Cisco Login).
I assume that you understand Cisco IOS Licensing and Software Packaging in 15.0 and that you have good comprehension on the Cisco IOS Software Release Model.
First step is work out what are the mainline software releases for the IOS for that platform. For low-end Cisco switches it’s always been IOS 12.2SX in the past but today the code trains are MUCH more complicated.
So, now I know that the mainline code train is 12.2SE and 15.0SE.
So lets select the most recent 15.0SE code train.
Rhetorical Question: Why does Cisco insist on those brackets in the version numbering ? Why not simply use a “.” instead? Is there a reason for obfuscating the syntax like this ? Seems silly to me.
If you understand the Cisco IOS release patterns, you’ll know that after the initial release that Cisco will receive “feedback” from customers in the forms of TAC Cases describing bugs. Typically, every four to six weeks a new point release will be ship that rolls up the fixes that have been committed to the code tree. Theoretically, the latest version will have less bugs excepting where new bugs have been introduced (sorry, a bit cynical about that).
Of course, customers will continue to discover bugs and report them to the TAC (because Cisco’s testing of software doesn’t seem to be reliable, repeatable or comprehensive or …. something) so this will continue for the life of IOS.
So I’ve selected a feature set/license, and I’ll click on the “Release Notes” and
Gritting teeth. Cisco IOS Licensing inspires nothing but fear and loathing. It’s an inchoate mess of “feature marketing” designed to grub a better profit margin for Cisco. A program which isn’t working because customers think they have been overcharged when they bought the hardware.
Release Notes has taken me to the Product Page.
Ok, maybe the website is actually working.
Finding the Release Notes
Rethink the plan. Go to
Support, Switches, Select 3750-X
Clicking, clicking, always the clicking.
Now we can click on the release notes.
The Bug Scrub
So, now I need to spend a day or two reading the Open AND Closed Caveats in the Release Notes.The Open Caveats will tell you what bugs are still known to be in the current version and haven’t been fixed. I’ll have to make a judgement as to whether they will affect my deployment.
The Closed Caveats will help me to gauge how faulty or badly coded this IOS release is. It’s my opinion that each IOS version is developed by a single team of developers, and some teams are good on testing and quality control but many IOS trains seem to ship with a lot easily detected problems.
I’ve often wondered whether Cisco’s policy of “ship early” means that programmers just ship the bugs and figure that customers will find them anyway. No need for them to work too hard at it. It’s a form of moral hazard that bothers me.
Reading the Caveats and researching the known bugs will inform me whether this IOS train is a train wreck. It’s my understanding that these are only the public bugs that customers have reported and doesn’t include the internal bugs found by TAC or SEs (these are not usually made public). Basically, I’m looking to build a level of trust in the technology on the limited information available.
Obviously, the longer a code train has been out there, the more information you will have. Don’t let the number of bugs distract you, keep focussed on the big mistakes (called “show stoppers” or Cat 1 Bugs). Also, check the date that the release notes were last updated (in this case, 23 Jan 2012 which is about six weeks so there could be a lot more new bugs on a new product that has just started shipping).
If I need more detail about a Bug, then I’ll go to the Cisco Bug Toolkit and look up the complete details of that bug. Sometimes this entails reading bug reports for several hours or even days. No, it’s not very exciting.
Reading the Caveats
So the Open Caveats shows some of what I call “forgot to test it before we shipped” Cisco behaviours.
- The switch might occasionally reload after experiencing a CPU overload, regardless of what process is overloading the CPU.
- There is no workaround.
- CSCtn11683 (Catalyst 3560-X and 3750-X)
- A Catalyst 3560-X or 3750-X switch port might stop forwarding traffic. The packet counters increment for sent packets, but not for received packets.
- The workaround, to bring up the port, is to save the configuration and to restart the switch.
- CSCtn46265 (Catalyst 3560-X and 3750-X)
- When you enter the copy running-config startup config privileged EXEC command on the switch, the running configuration is not always saved to the startup configuration on the first attempt.
- There is no workaround. If you wait for a few minutes, the configuration is saved when the switch attempts it again.
In the Resolved Caveats
- When you configure a port to be in a dynamic VLAN by entering the switchport access vlan dynamic interface configuration command on it, the switch might reload when it processes ARP requests on the port.
- The workaround is to configure static VLANs for these ports.
Some of these bugs look reasonably serious and suggest that the either the testing coverage was not very comprehensive or the code has a significant number of flaws inserted by developers. There is a strong risk element to be seen here but overall, not too bad.
What about 12.2SE software train
All right, so lets have a look at the 12.2SE code train. The 15.0SE release doesn’t look too bad, but it’s only been out for a couple of months, and there isn’t a lot of product out there.
Again, what I’m looking for is failures or bugs against basic features that are indicators for a bad overall code version. I’m not much interested in multicast bugs or weird bugs that only one person in the world would have.
- If you configure an EtherChannel and add new domain members, CPU usage in the switch is unusually high. High CPU usage is seen also when you configure an EtherChannel and add EnergyWise-capable endpoints with a different EnergyWise domain to an existing domain.
- The workaround is to disable the port channel where the high CPU usage is seen.
- When ports in an EtherChannel are linking up, the message EC-5-CANNOT_BUNDLE2 might appear. This condition is often self-correcting, indicated by the appearance of EC-5-COMPATIBLE message following the first message. On occasion, the issue does not self-correct, and the ports may remain unbundled.
- The workaround is to reload the switch or to restore the EtherChannel bundle by shutting down and then enabling the member ports ……..
- CSCto14414 (Catalyst 3560-E, and 3750-E switches)
- When you enable IP Address Resolution Protocol (IP ARP) inspection for selective Q-in-Q, IP ARP inspection drops all double-tagged packets even if you have enabled it on a C-VLAN or S-VLAN.
- The workaround is to disable IP ARP inspection.
- The switch crashes due to the “HACL Acl Manager” memory fragmentation when a large access control list (ACL) is modified.
- The workaround is add or remove ACE entries in sequential order when the ACL is modified.
- A memory leak occurs in the SSH process, and user authentication is required.
- The workaround is to allow SSH connections only from trusted hosts.
In this case, it seems clear to me that the C3750X should use 15.0SY release train. My logic here is that the product is new, and the 12.2SY train is not likely to get a lot of support for the long term future product. It’s really there for people who don’t want to upgrade for fear of 15.0 code. There are many companies that would regard IOS 15.o as a major change in standards and will not permit it’s use. (I make no comment about this).
The IOS15.0SY doesn’t seem to have too many problems. The bugs look reasonably low order, and not relevant to my use case. So I’ll select 15.0.1SE2 version with some confidence.
The EtherealMind View
You can’t protect against bugs in software in any vendors equipment, but Cisco seems to appreciate their customers time and energy in performing and testing for bugs – I’ve done enough of it over the years. I’ve often wondered if there is a moral hazard 1 effect here. Is Cisco doing enough to make quality code ? I’m not so sure and I’m quite tired of it now. If I’ve spent a large amount of money purchasing a product, do I then have to test it to check it is fit for purpose ? This process is time consuming, boring, and does little to actually quantify and address the underlying risk. At best, I can make an educated guess on whether the product is ready for mainline based on ten or fifteen years of bitter experience.
I know about the Cisco Safe Harbor programs which offer certified ‘safe’ versions. But these are two or three year old IOS releases that have been proven to have reached a near zero defect reporting level aka no one is finding any more bugs. This also means old hardware and limited features.
If you have a reseller who sells a fair bit of Cisco, then sometimes they can help you. Of course, the reseller will need to have sold and installed that specific model of hardware/software, and deployed it and managed it for some time to be able to give you an honest answer. In my experience, resellers rarely have the time to follow through on questions of reliability and, when challenged, they usually say “Well, no customers have complained” or something that makes me lose faith in the answer.
To my knowledge, Cisco, or indeed any other vendor, doesn’t publish any good data on software performance or hardware defects on a per product pr per version basis. I’ve often wondered why this is ? The medical industry expects to receive statistical product performance information, why don’t we get something similar ?
Currently my favourite method is use my Twitter account to ask people for their opinions – I’m lucky to have a few thousand smart and knowledgeable people who can give me good answers based on their personal experiences. I do try to retweet people who ask those questions as well.
I hope you’ve found this useful. I’m certainly interested in hearing how other people resolve this problem! Is there a service or capability from Cisco that I don’t know about ? Are there better ways to decide the best version ? Please leave a comment and share your experiences.
- The risk that the presence of a contract will affect on the behavior of one or more parties. The classic example is in the insurance industry, where coverage against a loss might increase the risk-taking behavior of the insured. In this case, developers get lazy because customers will report the bugs and we can fix them later. ↩