Cisco, Culture of Buggy Code and the Failure of the TAC

In recent months I seem to have hit a lot of bugs in Cisco software. Across the board on the main software releases of IOS, NX-OS or IOS-SX I seem to be hitting a wide range of bugs, and some of them are pretty stupid. And I’ve realised that, in recent years, it has become so commonplace, so accepted that we actually plan our projects with time to test, locate and check for bugs. And that’s become an expensive and time-consuming problem.

Why do we put up with this ?

How is testing done ?

I’m told that Cisco does very little of their own testing. It’s outsourced to a number of specialist companies in India who then perform specified test plans (under terms of the scope of works ) and hand the results back to the developers. However, IOS development is consists of many different teams e.g IP Multicast and BGP is developed by two different teams, and tested by two different companies. Each team writes the test plans and then tests only it’s own code.

It appears that no-one is testing the convergence of BGP and Multicast because there is no team responsible for overall testing.

Process Failure

As a more concrete example, consider this bug report CSCts80367 Bug Details – AnyConnect 3.0 for Mac gets “Certificate Validation Failure” w/ ASA 8.4.

Symptom: AnyConnect 3.x for Mac gets “Certificate Validation Failure”

Conditions: AnyConnect 3.x for Mac connecting to ASA running 8.4 and using certificates to authenticate

Workaround: Downgrade ASA to 8.3

In other words, the certificate validation code in ASA 8.4 has not been fully tested against Cisco AnyConnect. That’s astonishing.
This is a serious failure of process to have let this bug into a shipping version of the Cisco ASA code.

Lets consider the prerequisites, just to check if this a unique problem:

  1. AnyConnect 3.x – common and Cisco is trying to “educate” users to move to AnyConnect.
  2. SSL VPNs – very common.
  3. ASA 8.4 – latest code with a lot of bug fixes for 8.3.
  4. Certificates for authentication.
  5. MAC OSX

These features are not some sort of unique corner condition that very few people use. This is everyday, garden variety use case that should be part of the standard acceptance testing. Ok, so MACs aren’t common, but they are NOT rare.

That this bug exists tells me that internally, Cisco isn’t doing testing properly.

Does the TAC allow bad testing

In a word, yes. Cisco has been rightly be lauded for the value of the TAC to customers for solving problems and building one of the best support organisations in the world. But I’m beginning to have the view that fixing problems that I should not have, is not valuable. That is, the TAC has to be good because there are so many problems with the products.

In fact, if you aren’t thinking about it, you might say to yourself “Hey, Cisco fixed my problem so it’s all good”. What you should really be saying is “Another bug in Cisco software, time to lodge a complaint for shipping faulty product!”

Ten years ago, I could accept some bugs. I needed the new features, fast and furious. Today, I need stability and reliability because of the cost of managing faulty product. Consider the cost of managing MS Windows servers – just the monthly patching cost (which should be mostly unnecessary ) is enormous – it’s so bad that there are entire companies devoted to patch management. This is poor product causing this, and you are paying to fix it.

In the same way, the TAC kind of hides the fact the IOS quality is low. At least, unless you have Partner maintenance in which case you are really going to have a bad time getting fixes.

Lessons from Apple

Once upon a time, I believed that my laptop needed several things:

  • Windows OS needed patching every month
  • Hardware replacing every nine months
  • Expect at least one, maybe two hardware failures in those nine months.
  • reinstall OS to blank formatted HDD, install Apps, restore data, and reset all defaults.

Since I moved to Apple MAC about five years ago, I’ve not had a hardware failure, I’m on my second laptop, never had to reinstall the OS, and every upgrade allowed me to carry my settings and data forward. Big difference.

That’s a qualitative experience where quality hardware, quality software, and gave me a great experience. That’s how I measure my vendors today. I demand nothing less.

What I want

  1. I want the confidence to say that it is unlikely that I will hit a bug. I accept that bugs are inevitable, but they should be an exceptional event, not something that we plan for. Can you believe that we actually expect bugs to be in the software and spend money to LOOK for them.
  2. None of us should pay for buggy software.
  3. We should not pay maintenance to fix a defective product, the vendor should make good on the promise of quality software and hardware. It should work as documented.
  4. Don’t accept second best. Lodge complaints with your account manager (if there are any left) or some other means. Blog about your bugs, and your experiences in getting them fixed.

Logically, Cisco must want customers to detect bugs for them – in which case I want free access to code updates AT LEAST. Or cheaper prices. Or both.

The EtherealMind View

In the past, Cisco has always shipped code early and told customers to ‘fill their boots’ – let us know if you find anything. I think, it’s time for Cisco to take code and product quality seriously. Instead of relying customers to find bugs and report them to TAC, Cisco needs to do their own testing.

And customers need to tell Cisco to improve their products. I can’t afford to be spending hundreds of hours testing for bugs, that’s what my vendor should do. If Cisco isn’t doing it properly, then I should be going somewhere else.

Quality matters. It really matters.

Postscript 20111116-1541

I should point out that most of this article applies equally to other vendors but I haven’t had the privilege lately to work on other vendors kit. My customers are heavily into Cisco therefore so am I. My previous experiences with Foundry (now Brocade), Juniper EX, and HP ProCurve switches, as an example, in years past have left a bitter taste and lots of ‘war stories’.

Vendors need to focus on software quality and great user experiences, and less on the rollout of ‘exciting marketing’ features. In the current economic climate, lets talk about new features but be careful about introducing them. Or create stable code versions, and unstable version so we know when we are at risk.

I am calling out Cisco, because the TAC isn’t as good today as years gone by (budget cuts probably) and my bad experiences are certainly piling up. I’m hoping for some change.

Hoping.

Related

Ethan has blogged on similar topics at Packet Pushers about software problems at Cisco.


Image Credit

  • Email

    Have you heard of Juniper ;)

  • Jeff McAdams

    >Why do we put up with this ?

    Some of us aren’t. This is one of the reasons that we have been aging Cisco out of our network.

    Our choice has been Brocade/Foundry for most of our routing and switching, with a bit of HP Procurve at the edge.

    We haven’t looked back and have been extremely happy with the decision to retire Cisco.

    FWIW, we were also a Cisco VoIP shop, with Call Manager, Unity, and a boatload of 796x’s that have all been replaced (the Polycom ip670′s are just better all around, and still less expensive).

    Truthfully, I don’t understand how anyone can still be an all-Cisco shop in this day and age. At the very least get some gear from another vendor in to keep Cisco honest (kinda like the advice to always make sure your Microsoft rep sees a Linux box somewhere in your office).

    • Josh Anderson

      The main problem I have with Polycom (and many other IP Phones) is that they are not aesthetically pleasing to many users. For engineers, that is rarely ever an issue, since functionality is king, but when you set a phone on the desk of someone who thinks they are important, they want it to look good. Sure, they may tell you that all they care about is dial-tone, but that is simply not the case. Sure, Cisco’s phones are expensive, but they sure look good (otherwise you wouldn’t see them on TV, right?), and for some organizations, that is more important than functionality.

      Personally, I’ve never been a huge fan of Polycom’s phones, but it’s been several years since I’ve used them, and I only used them with Broadsoft – perhaps new models in differing environments would be more appealing.

      • Jeff McAdams

        I don’t see that Cisco phones are any more aesthetically pleasing than the Polycom ip670 sitting on my desk…they’re both quite utilitarian in nature.  *shrug*

        And I don’t, for a moment, believe that you see Cisco phones on TV so much because they look good.  You see them on TV so much because Cisco pays for them to be there.

        The Polycoms have, hands down, the best audio of any phone I’ve ever heard (and we evaluated quite a few in this process), particularly when using the wideband codecs that they support, but even without the wideband codecs, they still have significantly better audio quality.  I’ve had several occasions when I’ve stuck my head in someone’s office to talk to someone that I heard in the office and only  then realized that they were on speakerphone.

        • http://etherealmind.com Etherealmind

          All desk phones are ugly. I mean, _really_ ugly.

          Designed by 60 year women who pine for the 1950s is my guess.

          I just unplug it and put it in my desk drawer.

        • Jmartinez

          Sorry but that Polycom phone is ugly! :)

        • Matt

           I think you’re looking at OLD cisco phones, look at the 8945. 

          It’s not acceptable to me that we live in a world of beta code, it’s not just Cisco, its everything now days.  The scary part to us, is that it has finally trickled into a segment we never thought it would.  No wonder most VARs make 50 percent of their money now on services.

      • Will

        I cant believe people still use ‘hard phones’    MS OCS and Lync are all that any company should need (or a cell phone once presence is refined to something usable).

        I think Greg even wrote a post on the future of Cisco VoIP…lol.

    • Anonymous

      I don’t think you can limit this complaint against just Cisco. In other words, just swapping vendors does not resolve the software bug issues. Where I work we’ve had major software bugs with Brocade MLX devices, and as much as I love Juniper, JTAC has a latest recommended stable release which lags quite far behind the latest releases with the software features customers need/want. I think the main point is to hold our vendors, account teams, and TACs to a higher standard. We as customers deserve a better experience. These aren’t tinker-toys we’re buying but large, expensive equipment and we should not have to deal/worry/work around product deficiencies while you maintain such a large profit margin.

      • http://etherealmind.com Etherealmind

        That’s a good point. Although I think Cisco has more of a “ship it and let the customers find the bugs” culture than others, it’s time that quality of code became more important. I should not __expect__ to find bugs in latest generations of mature software such as IOS, IOS-SX or even NX-OS V5.

        Yet, that’s what it is happening. This is undermining the integrity of networking as key element of IT infrastructure.

  • http://packetpushers.net/author/ecbanks Ethan Banks

    Echoing the other comments, Cisco has been able to take their customer base for granted. Maybe it’s time for those of us stuck in the Cisco rut to get over the inertia and start looking seriously at other options.

  • http://twitter.com/avalonhawk Ed Weadon

    Similarly I’m seeing this a lot with the latest ASA code. The 8.4 train, while better then 8.3, is still buggy as hell; by comparison, 8.2 was nice and stable. I’ve hit bug after bug on the 8.4 code since I’ve upgraded my main production pair that I’m actually considering downgrading back to 8.2. I just don’t look forward to the syntax and NAT changes I’ll have to go through.

    And yes, at this point, I’m seriously considering ripping these out and replacing with Juniper or Palo Alto. A hoard of angry Vikings might be better even… :)

    • http://etherealmind.com Etherealmind

      Yes. Very gun-shy of the Cisco ASA.

      In fact, anything from Cisco Security is suspect this year – they have produced a string of buggy code across most of their products.

  • http://twitter.com/krunalshah Krunal Shah

    Everyone has bad experience with some product in this case Etherealmind has bad experience with ASA. I had bad experience with Juniper support and Brocade buggy code specially with early days of multiservice Ironware 3.9 and 4.0. Simple etherchannels wont come up with these codes in certain situation.

    I work for partner support organization for Cisco, Juniper and Brocade and interact with them daily. But let me tell you I have got many bad experiences with Juniper and Brocade then Cisco. Cisco TAC minimum experience level is 5 years in networking with a Bachelor of science university degree. Often I got  Juniper or Brocade TAC who tells himself engineer but does not know how to begin troubleshooting OSPF/BGP or even switching issues. In my experience time spent on phone with Brocade or juniper is far more than time spent with Cisco TAC in explaining the problem and identifying issue. Getting escalation TAC for Cisco to work on a bug is much easier than getting same resource for Juniper or Brocade.

    In terms of bug fixes, yes.. Cisco and Juniper are faster than Brocade. In my opinion Cisco is networking vendor gives a lots of flexibilities with their products in terms of administration and configuration perpective this is why their products are tested heavily on the field compared to Juniper and Brocade.

    If someone takes release notes of latest IOS version of code and test all the bugs with equivalent JUNOS code, JUNOS code probably will have less bugs fixes. But if you take JUNOS release notes and tested it with equivalent of Cisco IOS, IOS might have more features and more bug fixes. Juniper recently improved their support and code quality with JUNOS 11.X but it is also not fully tested on field and proven to work.

    Vendors are always in a race to ship new feature to market with new code but that code takes time to get stable. After all customers are real test engineers for vendors to test the code.

    • http://etherealmind.com Etherealmind

      “After all customers are real test engineers for vendors to test the code.”

      No. NO. NO! We are customers, we should get stable, tested and working code. Vendors MUST do more testing and develop better processes.

      But what I object to are really DUMB bugs that should have been tested.

    • Mark Berly

      Your statement: 

      “Cisco TAC minimum experience level is 5 years in networking with a Bachelor of science university degree” 

      is incorrect they will most definately hire without those qualifications many of the best TAC engineers had no formal degree just lots of real-world experience. Equating quality of support with the amount of education somebody has is ridiculous I have recurring nightmares of getting some newbie fresh out of college or with a couple year experience on the phone.

      IMO…this is about poor code quality and forcing customers to be the test bed…

      • guest27

        Having a new hire fresh out of college on a TAC team in my opinion is an extremely valuable asset.  TAC is a fast paced learning environment and in a few months those kids are going to be solving issues that some people would never know where to start.  They have the best resources available to them for help, and they are motivated to learn as much as possible thus bringing forth the best customer service.  Yeah they might not know the answer off of the top of their head…but they sure as hell will do whatever it takes to find it for you and provide the best customer service possible.

        Obviously there are going to be experiences where having a new hire on the phone might not be the best situation, but these kids within a year are going to be fantastic troubleshooters and will bringing your network back off its knees.  Don’t bite the hand that feeds!

        • Guest9999

           Sir

          I hope you are joking..but if you are not that explains poor tac performance which happens often.
          When i buy cisco or contact cisco i expect nothing but the best ! because the price is also the highest!
          My 6 year experience with tac people is that often i get people whom may have the knowledge but lack experience.
          If everything was about search and find people would not be contacting TAC ! (at least people who do their homework properly)

          Regarding the buggy ios software this makes us real network engineers…not fresh out of college 0 experience peps, look bad in front of other people because people will remember that the network didnt work not that it was a bug caused by the fact that cisco doesn’t test their IOS properly !

    • Guest71

      I was at Cisco.. and there were tons of features that were put out to customers without testing… the field sales had to bear the brunt of a CAP case regularly for that…

  • Will

    What a great post.  

    I’ve heard that companies (Cisco for example) sick of inane help-desk support for Windows PCs have migrated to Apple and found that tickets significantly decrease.

    I think it is impossible for a company in the networking field that provides as many features as Cisco does on its equipment to do any real testing.  There are thousands of variables that they’re bound to miss or by the time they test they’ll have lost market share or customers.

    I somehow have the feeling that the NXOS bugs are what have caused you to write this post and maybe that ASA bug put you over the top.  One of your posts from last year explained that the 7010s would require at least 2 upgrades in 2011.  I’m up to four as of last weekend.  I’m up to four on the 1000V this year as well.  I’ve also had a horrible experience with the ASA-SM shoved down my throat.

    Of course the alternative is regressing back to Cabletron.  Then again I cant ever remember upgrading Cabletron code back in the 90s. hmmmmm

    • http://etherealmind.com Etherealmind

      I feel your pain.

      Still, Cisco can and should do a better job of testing. Or provide a way of accessing code that is reliable.

      • http://twitter.com/jedelman8 Jason Edelman

        The interesting thing about this post is it’s all in relation to vertically integrated systems, i.e. hardware and OS from the same vendor, which isn’t a great thing or a bad thing.  It’s just what we know as an industry, and yes, all manufacturers have their fair share of problems.  Cisco will have more b/c they clearly have the market share in many of these areas of technology.  

        Regardless, the question I have is, and it is a semi-loaded one, with the slow evolution of software defined networking, will there be MORE or LESS bugs because we would now be dealing with SOFTWARE controllers in horizontally deployed systems, i.e. one vendor provides hardware, one provides the OS, one provides additional apps that are layered on top.  What should we expect from a stability standpoint for SDNs?  It will be interesting.

    • Ryan Malayter

      Will: “I think it is impossible for a company in the networking field that provides as many features as Cisco does on its equipment to do any real testing.  There are thousands of variables that they’re bound to miss or by the time they test they’ll have lost market share or customers.”

      There is, in fact, a solution to this problem, and that is proper automated unit and integration testing. It’s clear that some vendors just can’t do that well, or are so heavily invested in a legacy spaghetti code base that it is impossible to retrofit in a cost-effective manner. Post-release, every support case that is a bug should result in new automated unit and integration tests for future releases.

      There are far more complex and life-critical software/hardware systems out there than any Cisco OS. The only way these things can be built reliably is to be assembled from discrete components that have extensive unit and integration test coverage. I guarantee you, for example, that the software on Boeing 777 is a lot more complex than NXOS or IOS-XR, yet far more stable.

  • Bob

    It’s pretty disappointing really. If it were bugs for features not regularly used or bugs that only surface in weird uncommon configurations then maybe it wouldn’t be so bad. When it’s simple things like CDP, or spanning tree, and it’s blatantly obvious that even the bare minimum of testing hasn’t been done it gets very annoying. We were involved in EFT testing for some new code and pointed out a bunch of stuff that was very broken, and that we had been promised would be available in the production release for the EFT we were testing. They just ignored it and released the code anyway.

    Then, when we log a TAC case we end up waiting a week because the engineer has to find equipment to lab it and reproduce the problem, even after we’ve told them they can WebEx into our LAB (ie. NOT reproduce the problem without affecting anything in production)! TAC is nowhere near as good as it used to be. I’m currently looking at an open case where the TAC engineer requested a WebEx to see the problem on our equipment (again, it is LAB equipment on which I can reproduce the problem in about 5 min, including bootup time). This was requested a week ago and I replied immediately that anytime is ok for a WebEx. Still waiting. 

    • Jeff

      Bob – can you send the SR # you are waiting for follow-up on to me at jzirker@cisco.com? I’ll make sure someone contacts you.

      • Jeff

        Bob?  Just checking in again.  You said you are waiting for a response from TAC regarding a recreate.  I can help you if you give me your SR#.  Regards.

        • Bob

          TAC have now been able to reproduce it. I appreciate your offer but I don’t see why a comment on a blog should allow me to get preferential treatment. I had already escalated this with our local SE anyway.

  • Etherealmind

    Your comparison to Apple is not valid.  I also have a MacBook Pro and a Mac Mini at home.  I have had the hard drive of my Mac Mini fail because of an ungraceful shutdown which the O/S could not recover from.  That was a painful reinstall.  I have several problems with my MacBook Pro, from the built in VPN client being buggy, to the laptop not coming out of sleep mode about 5-10% of the time, requiring a reload.  Again, not fun.  I know people who upgraded to Lion and are having wireless problems, among other things.

    There is no perfect software company.  As projects get bigger and more complex, customer demand for features continues to rise and competition pressures early release of software, you end up with this “perfect storm.”  Cisco has its problems and I hope they get fixed sooner than later, but I can tell you they are not unique and actually, the support structure often does help soften the blow.

    • http://etherealmind.com Etherealmind

      I don’t think I was claiming a bug/problem free Apple experience but it’s very close. The one time I’ve reinstalled MAC OSX from a Time Machine backup, it was much easier and more effective than a Windows restore.

      By calling out problems, I’m hoping vendors will fix them. Sometimes companies get a complacent or lazy and pointing out their flaws will help them.

      I hope.

  • David Russell

    Sorry to tell you but I have used products from various vendors and they all have annoying stupid bugs and feature holes.

    A perfect example is from a recent NX-OS release where adding a PVLAN description breaks the access list.  how does that happen or not be picked up in testing?

    What makes a difference to me is whether that company takes your report seriously and tries to identify the problem in their lab on their time instead of in your production network.

    • http://etherealmind.com Etherealmind

      See, that’s my problem. That sort of bug SHOULD never be allowed to pass testing. And if it IS being published, then the company is delivering sub-standard products.

      Fixing the product matters, but not having bugs matters much much more. Bugs cost me money & time & reputation. Working with the vendor to fix them costs me time & money. Installing the fix costs me money and time.

      Did I not pay the vendor to deliver a good product in the first place ?

      • John G.

        Sometimes these bugs are not as easy to catch as it would seem.  In many cases, the exact feature combination and specific configurations come into play.  This comes down to a balance of test time vs. how often features ship.  I would like to see a way to convince customers they have to wait for testing to be done, but many still want fast and furious and bug free.

  • LKM

    Wow, how many network devices do you deal with? You’re a CCIE that seriously thinks an operating system has anything to do with hardware failure? You honestly need everything handed to you on a silver platter? What happened to not blaming your tools? Get the job done with what you’re given. Windows / OS X / Solaris, doesn’t matter, just do your job.

  • http://www.facebook.com/profile.php?id=689232171 Art Fewell

    The problem is complexity and continuing to build on a legacy model. Fortunately the industry is solving that and I am looking forward to some slick controller software built by actual modern software guru’s. There is no way with the legacy overcomplexity in the system that its not going to suck, granted they could do better QA, but I dont see this getting better until I am working an SDN controller from a company with software culture. 

  • Mark Berly

    Cisco has forgotten about customer service the TAC is just a band-aid on a much bigger problem at Cisco of code bloat, politics, ignornace and arrogance. Cisco can pull themselves out of their downward spiral but will not as most of the executives care more about protecting their turf than doing the right thing for the company…

  • Zdrawcke

    Come on, grow up. If there was a better product with less bugs and for less money than Cisco offering, that would be a BUSINESS CASE for Cisco to have more bugs fixed and more testing in place. This market like any other, is not about being good, it’s about being the best of the bunch. If you like Apple so much, why not replace Catalysts with Macs? Ah wait – they are not doing the same thing. Why compare Apples with oranges then?

    Apple has been around for some time with Macs, but I don’t see Bill Gates going down the drain. On the contrary, he has a bigger market share than Apple. It was not Macs that propelled Apple’s recent success, it was ipods and iphones. Guess what – Jobs did not invest in fixing bugs on Mac, making them thinner and prettier, he went ahead and introduced new features and products instead. That’s what brings food on the table – on the inside of the private jet.

    That’s how things work in business, the world of grown ups. You evaluate competition and balance your investment to make your product better in every way. Not good, just better then everyone else’s. If you can make it really good along the way, even better.

    We are all frustrated with bugs, but we live in world of ruthless competition and greedy corporations. All this is just a small part of bigger, nasty picture.

    • http://etherealmind.com Etherealmind

      As I’ve stated, I accept that SOME bugs will happen. My problem is that I’m now spending more money on testing for AND finding bugs. That leads to two problems

      1. management doesn’t trust the network to be reliable
      2. the cost of networking is very high
      3. my vendors are producing second rate products while charging premium products.

      You can choose to accept faulty products but I don’t. I AM changing vendors but it’s a slow process – it would be better if Cisco, Juniper etc would take more care and time over their software testing to have less bugs.

      Hopefully, the right people from Cisco are seeing this ( since they are already contacting me) and changing their plans.

    • branto

      The company best at MARKETING & SALES, not the technology wins.  Period.  Technology often only has to be “good enough” to perform a set of functions.  Same as in many fields (music, publishing, etc.).  The best at the craft in the field in no way equates to the best selling in the field.I would also like to argue the case that as we all progress in our careers, the issues we face tend to have the view of something that “should have been fixed” or “should not happen”.  I’ve opened up some mind-numbing TAC cases across multiple vendors.  It’s the same problem.  All vendors have it.I’m not sure that the distinction has been made, but Feature Velocity is mostly driven by customer requests, and I’m pretty sure that’s where most bugs are introduced.  If feature velocity slowed down some, TAC is your interface to the resolution of the bugs, but they are far-removed from their creation.  It’s really hard to find good people and keep them motivated in these front-line roles, especially when they have to eat daily bowls-of-shit from angry customers.

  • http://www.facebook.com/anyconnect Cisco AnyConnect Product Team

    Sorry that everything has not worked flawlessly for you and any others who have run in to problems.  This particular defect ended up being timing related, so while the bug description may sound like such a basic combination was simply broken (and should have been obviously caught as part of testing), the reality ended up being more complex which is how it unfortunately got missed. Either way, I am sorry that we let you down by not having everything work perfectly, as that is what is most relevant.

    All of our ASA and AnyConnect testing is performed by Cisco employees and/or automation run by Cisco employees (we don’t outsource our testing).

    Our focus is to provide a top quality product and we have tried to do so by automating a significant portion of our test plan, combined with manual testing when automation cannot be performed. We aim to fix as many critical issues as we can prior to release and when problems happen to be found post release, we try to resolve these in a timely manner by providing periodic maintenance releases.

    If you do not feel that our product is meeting your expectations or you have any questions, we make ourselves available directly to customers to communicate with us.

    Pete Davis, Product Manager – Endpoint & VPN
    Cisco Systems, Inc.
    ac-mobile-feedback AT cisco.com
    @AnyConnect:disqus

  • Cristian

    Beside buggy software, our experience with the Cisco hardware has been pretty bad recently.
    We purchased at the same time around one hundred Cat 4506 switches with Sup 6-E Lite supervisors. Once installed in production, a lot of these switches keep reseting by themselves. Called Cisco, and it turns out that a large number of switches have serious hardware problems. Around 30 of our switches have defective supervisors, the Sups need to be replaced.
    Taking down a switch means going through  the  management process, there is no fun in explaining to the business units why they need to accept an interruption. To add insult to injury, Cisco refused to replace all supervisors at the same time. We need to wait for the defective supervisors to fail, and only then we will get one by one the replacement modules.
    We just upgraded the IOS in all the 2960 switches, and quite a few started reloading by themselves. Every time we call TAC for this problem, they send us a replacement switch. Looks like another botched batch of switches. Makes you wonder how Cisco tests their hardware before it is shipped to the customer…
    Cristian

    • Tomas Fidler

      HW testing is another case… cisco uses companies like foxconn to build device from components and test HW. Then Foxconn is given rules, like no more then 0,1% returns, what tests need to be done etc… In reality it is on Foxconns managers how to fit into “returns limit”… (every test cost money -> not all tests are necessarily …usually :))

  • Bob

    I guess I should mention a really good experience we had recently with TAC for a problem we had with 7600 series routers using ES+ linecards. It wasn’t a bug so much as an issue with interaction between the RSP and the ES+ linecards. Within 15 minutes of logging the TAC case online I had an engineer on the phone discussing the case with me (configs, diagrams and sh tech etc. had been included when I logged the case). This was not a service interrupting issue as we were testing everything in our lab prior to deployment, so that response time was very good. The engineer was clearly not experienced with all the features involved and muted the call a few times, obviously to discuss with a senior engineer. After doing this a few times, the senior engineer was added to the call and we had a fix within 30 minutes of me logging the case with TAC.

    That is the sort of support I have become accustomed to over the years from TAC and sadly, it now seems to be an exception.

  • Anonymous

    Oddly just as we can accept bugs 10 years ago as network
    technology matures and grows in new and unique features so does the opportunity
    for new bugs to appear. It is one thing if after 15 years of IOS code basic ACL
    or IPv4 routing protocol processing bugs are still present in 15.x code then
    yes I would agree with the premise of the article regarding stability and
    reliability. Especially if ASICs and other hardware related platform components
    stayed in a vacuum. But advances and complexity with ASICs, processors, and the
    code that goes with them so that router can implement trustsec, voice, vrfs,
    bla bla (especially the last 3 years)the bugs are present due to the amount of features
    and may appear to be increasingly prevalent. Not to mention the shift from monolithic
    to non monolith code bases.  But this may
    be the norm at this time in network technology history. I do agree that the
    vendors should do more testing to save us engineer’s time and provide us
    confidence. But then again the open source Linux distro types live and die by
    this environment and we seem to accept that usage. All OEMs, products, and features
    are not created equally and their respective “figure of merits” is unique.  Testing should be improved and increased but
    these are business that has to get product out the door so an all encompassing
    testing schema is very difficult to achieve to ensure nothing falls through the
    cracks. Otherwise we would still be waiting for some features. There is always
    ISO certification;)

    And wait for the OpenFlow stuff whoooo hoooo for the TAC.

  • Dan

    Yep, I can relate to this post… I bought a 3945E router last year and couldn’t run a publicly posted version of firmware that would work with all of my needed features for over a year! A firmware update for a switch recently broke RADIUS authentication and I just found a bug in ASDM for ASA.

    • http://twitter.com/IntrinsicNetwor Intrinsic Networks

      ASDM has its fair share of bugs, even those that are totally cosmetic. How often have you seen the warning pane asking you if you want to apply your changes, selecting “yes” and then being told that “no changes were made”. Totally cosmetic, no impact at all, but surely somebody at Cisco must have noticed it as well? 
      Barry

  • Tomas Fidler

    What I remember from software engineering:
     “Feature without a test don’t exist.”
     “If you don’t test it, it doesn’t work.”
    This simple rules are drummed into software programmers heads on a college/uni.
    But proper Testing can be costly (half or more cost of development) and If you cut budget and want to keep the speed of innovation… well, managers already found a solution ;) 

    I’m interested what is honorarium of this unique and smart idea: “Test in India, and please not too much”.. I, personally recommend to give that manager at least $1000000 :).And the other one, that is responsible for decision to spreading testing responsibility to many companies (so nobody is responsible for a fail, even “smart idea originator”), please give him another million :)Where to take this money from? … easy…. from sales department honorariums, because sales will shrink.

  • Anothermouse123

    One should note that IOS, IOS-XR and NX-OS have been totally run seperately and could be from different companies as far as quality and Process is concerned.  Trying to lump all 3 into the same category is really just bemoaning the focus of the industry.  However, don’t expect Cisco quality to be any better in a couple of years, they seem to be defocusing from quality and focusing on faster feature delivery.  Expect to see the ramifications form that in around 2 years.

    • http://etherealmind.com Etherealmind

      If they were really delivering features, I’d be less concerned. But fiddling with PfR, or adding new MPLS-TE functions should not have impacts to IP Multicast or LLDP … or whatever.

      Code should be stable and well tested with regular testing regimes.

  • Barry Hesk

    Greg

    Spot on. I’ve been working with Cisco products for 20+ years. I fondly remember the days of *assuming* that everything I configure would work out of the box first time every time and rarely being disappointed. 

    My mindset has now changed. I’m now surprised when a new feature works. Generally this also involves downloading Engineering Specials from Cisco which I install. The new feature works, however some stuff which was working before then almost always becomes broken. 

    You are totally right. It’s all about desperately poor code quality and (lack of) testing. Some of the code that makes it out into production isn’t even beta quality. A few examples off the top of my head: 

    1. A version of IOS running on Catalyst 6500 VSS cluster that caused a crash and restart when a service module was reset. In our case this was applying a new signature file to an IDSM-2 – and caused the whole VSS to let go. 
    2. An issue with 15.1 trains that stops PPP working on an ADSL interface if it ever bounces following the router load. 
    3. An issue with 15.x trains that causes devices to lock up as soon as you enable debugs. How can you debug bugs if you can’t enable debug? 
    4. Numerous bugs in CUCM 6, 7 and 8 that have made my hair go white. Too many to mention including major issues even installing it. 
    5. An issue with the Cisco VPN client that doesn’t let it process 4096 bit root certificates which will never be fixed as far as I’ve been told. 
    6. Numerous bugs within ASA 8.2, 8.3 and 8.4 including crashes / reloads – often related to service policies and application inspection. 

    And this is just off the top of my head and has all happened within the last few months. 

    All of the above could and should have been spotted with any sensible test bed platform and process. My customers insist on thorough testing of all platforms (not just networking) before things get placed into production. I just wish Cisco would do the same. 

    I like Cisco products. I’m a previously  loyal Cisco partner. Historically I’ve been able to justify the premium price tag on their solutions by talking about about the quality of the software and effective “zero touch” support requirements once stuff was installed. Unfortunately, this is no more and I’m forced to consider alternate solutions for my clients who expect and demand of me best advice. 

    I hope Cisco listens. The fact that the AnyConnect Product Team have posted here is encouraging. I’d love to see similar postings from other teams. Admitting there is an problem is the first step towards a cure. 

    Barry

  • Comments

    I totally understand your frustration. But from my personal experience dealing with Cisco Gear and with Cisco’s TAC has always been far less frustrating as dealing with Check Point, Juniper, HP, Brocade and Extreme. I really hate it if i find bugs within my first 10 minutes with a new product.
    That never happened to me with Cisco Products. From my point of view Cisco is currently one of the best alternatives. But i think that depends on wich product series and wich features of these products you use.

  • Yap Chin Hoong -

    Sometime is not bug that hit me, but some unknown modification upon the IOS codes. :-)
    http://www.itcertnotes.com/2011/04/hidden-and-undocumented-eigrp-too.html

  • Antonio Pezuela

    We are affraid when we have to update IOS. When they solved one bug, and you think, it will fix our bug and how many bugs will it have now? And then, we prefer keep our well-known bugs than discover new ones :-D

  • Guest 777

    I can relate to this thread.  I find bugs in the wireless LAN controller code all of the time.  As a matter of fact, I am running the “special engineering” code that solved some bugs just for peace of mind and stability. We will be needing some new features in the future and I can honestly say that I dread upgrading.  

  • Sam

    Another +1 here.

    I think the point is well made – yes, bugs will happen, however there are things that simply shouldn’t be making it out the door.

    My war story is on the ASR9000 (IOS-XR). Devices with version 4.0.3, no problem. Upgrade to 4.1 and discover that you can no longer add to prefix-sets. SMU to fix this had a 2 month ETA.

    Considering where Cisco target the IOS-XR based platforms, you’d link adding/removing entries from prefix-sets (in turn used by BGP etc) would be tested for sure. How confident do you think I am with XR after that little experience?

     

  • http://twitter.com/andrewjones141 Andrew Jones

    I had a classic example of this back in october 2010 when i appeared to be the first person to try and configure a stack of 2960-S switches to perform inter vlan routing, Turns out it would sit in the lab for a weekend without any traffic fine, but as soon as it went into production started to crash and reboot.

    Just a stack of 4 2960-S switches, about 4 vlan svi’s and a static default route was enough to take it out.. CSCtj37604 is the bug the TAC engineer had to create for it. 

    if only id run some traffic through it for an hour or so before putting it into production, i would of saved a weekend of putting them in, patching everything, then pulling them out again when they failed….

  • Drpepperaddict6509

     If my client doesn’t give me permission to reload a router because of change control politic and blau blau, I just do a bug scrub to find me the sweet buggy “show command” that does a system bus crash… it has to be a good bug  too in case the router sends an AAA accounting packet back to the ACS server and logs what you did  :)

  • guest

    great blog,

    I really agree, the vendors need to test the stuff they are peddling period. Keep on forcing pain upon your customers and over time you will experience the same fate as the Chrysler/GM’s of the world, not just lost market share but loss of reputation, which is much harder to recover from. Reputations take a lifetime to build, only a moment to destroy. Stupid bugs that cause misery on your organization are not acceptable and should not be inflicted upon the very people that your organization relies on to survive.

    What are you to do though? We are stuck between a rock and a hard place. Can we take a leap of faith on some other vendors kit? There is no bug free software, and one thing TAC does bring to the table is eventual (not necessarily timely) resolution of the problem.

    From a personal perspective, after working exclusively with Cisco for the last 15 years, right now, the network I am looking after is totally Cisco free. It consists of Palo Alto firewalls, Procurve for the clients, A-Series for the DC.

    I do seem to be opening a lot fewer cases with vendor support than I am used to.

  • Boyan Biandov

    TAC has become a useless joke due to outsourcing. It used to be that you could skillfully skip India by timing your call to TAC so it ended up in San Jose or technology triangle; now all calls go 100% to India. You are stuck with endless struggle to communicate; unwillingness to escalate to the true experts since if an “engineer” escalates a case one probably gets rupees deducted from one’s pay. Yet the Smart Net contracts are priced at market level? If they priced them at pennies on the dollar I would understand the level of service but nope; some Zukerberg wannabes up there want to play with real money but provide substandard product.