If I Can’t Fix It, I’ll Break It. And That’s OK, It Needed Fixing Anyway.

There are times when working in large companies that the fear of making any change can mean that a simple outage can become a major problem. For example, I have had Core Switch upgrades delayed for nine months because of refusal to accept that the network would be offline for for 12 hours. 627349_hammer.jpg Of course, when the Core melted down (as predicted) it took two hours to fix since we just went ahead and railroaded the upgrades into place. Bit unlucky that it was the middle of the day though.

Or the time that we spent preparing to cutover the NetScaler load balancers from the F5 but no one would agree until an unplanned outage happened and then we just switched over at that point.

For moments like those, when you are sitting in front of the core switches and about to reboot, knowing that your butt is on the line, and your work colleagues are going to tease you mercilessly if you flub it, and the boss is going to blame it all on you, here is a little thought to help you through.

Etherealmind’s Action Motto : If I can’t fix it I’ll break it, in which case it needs fixing, anyway.

From time to time I get resistance to making changes to an existing network. This always strikes me as odd. No system can remain static, you will end up with something that is doesn’t change as the business changes. In no way am I abrogating change management and control. This is a vitally important function. But fear of making any change can mean that a simple outage can become a major problem because you don’t know how to troubleshoot.

Corollary : I won’t make it any more broken than it is now.

Remember, you learn the most from your mistakes. You don’t learn from doing things you already know.

If you don’t fix it, you’ll break it and that’s OK, it needed fixing anyway.

Image Credit
About Greg Ferro

Greg Ferro is a Network Engineer/Architect, mostly focussed on Data Centre, Security Infrastructure, and recently Virtualization. He has over 20 years in IT, in wide range of employers working as a freelance consultant including Finance, Service Providers and Online Companies. He is CCIE#6920 and has a few ideas about the world, but not enough to really count.

He is a host on the Packet Pushers Podcast, blogger at EtherealMind.com and on Twitter @etherealmind and Google Plus

  • Dmitri Kalintsev

    Hi Greg,

    Could you please clarify – the proposed changes to those networks, I assume they did come with a clear roll-out and roll-back procedures, which were signed off by the stakeholders (who know how their applications depend on the affected network) and tested in a simulated controlled environment beforehand, did they?

    • http://etherealmind.com Greg Ferro

      Yes, all of that was done. But people were reluctant to approve the change because the perceived impact was a total loss of service if something went wrong. That is, if the upgrade failed the network core would be a complete failure. Of course, this was unreasonable but sometimes there is no reasoning with Change Management and the way they analyse risk.

      • Dmitri Kalintsev

        Fair enough, but I would have thought that the roll-out/roll-back plans would have covered all major disaster scenarios, including “failed upgrade” (for example by keeping a cold stand-by from lab with the original software/configuration, or something).

        What I mean, the plans would say: here’s the possible things that can go wrong. Here’s how we dealt with each one of them and tested it in the lab (including going back) and here’s how much time it will take in each particular case.

        If change control people are not happy with the list of scenarios – fine, suggest more, we’ll go back and cater for them. If they can’t suggest any more and neither can the people who are responsible for the applications (and still they are blocking the change to go ahead) – it’s a justification to escalate the issue to the change control people’s management.

  • Dmitri Kalintsev

    Thinking a little bit more – the example you’ve shown with the core meltdown is a clear-cut case of a catastrophic failure of the very function of change control: to protect the business continuity. Depending on the circumstances, it probably is a good enough reason for escalation to the company’s senior management, even post fact.

    On the other hand, “breaking” it to get it properly fixed, even while the intentions are noble may turn out to be a disservice for yourself and the company you’re working for in the long run.

    Anyway, just an opinion.

    • http://etherealmind.com Greg Ferro

      We knew there was a problem and put a plan in place to fix it. Change management meant that no one could make a decision to fix it (due to perceived risk) or not fix it (due to known possible failure risk).

      In the end, nothing was done. And it broke, affected SLA’s and problem was fixed. That’s the joy change management.

    • http://www.tolya.com Anatoly Gavrilov

      Too many business buzzwords. To be honest, Change control people don’t know anything about networka and they don’t give a shit. But you still need their approval and they can always say: “I don’t want to be blamed for your change, so can you go and get another dozens of approval from people that may use that part of network.”

      • Dmitri Kalintsev

        > ìI donít want to be blamed for your change, so can you go and get another dozens of approval from people that may use that part of network.î

        There usually is a set change and risk management process with an established set of stakeholders. If that’s not the case, then such company has bigger problems than a broken network.

        • http://www.tolya.com Anatoly Gavrilov

          All companies have different problems. I’m just trying to say that Change Management is not panacea for all ills. When you grew up (I’m really talking about large companies with more than 1000+ devices in the network) it becomes hard to change anything just because it’s too much beurocracy.

          • Dmitri Kalintsev

            I understand. The point I’m trying to make is that if a company’s IT infrastructure complexity outgrew its changes control processes, there will be problems.

            I am also suggesting that it is entirely possible to create an effective change control process (or modify an existing one) which can scale to match the complexity of the growing IT infrastructure. Whether or not a particular company is prepared to make an investment to do so is the matter of their business priorities.

  • http://eatyourpets.com/ yeled

    My motto on linkedin is ‘breaking things for the better’.

    • http://etherealmind.com Greg Ferro

      Like that

  • Chris Fabri

    This reminds me (although it’s obviously not the same) of the shade tree mechanic’s tongue-in-cheek credo – fix it ’till it’s broke!

  • http://twitter.com/northlandboy Lindsay Hill

    (Holy thread revival Batman! Came across it while looking for something else)

    This comment is so true: “But fear of making any change can mean that a simple outage can become a major problem because you don’t know how to troubleshoot.”

    Years ago I worked in a telco, where we regularly patched, upgraded, changed systems. So we were used to troubleshooting, and dealing with issues. I later worked at a financial institution, where they never patched or upgraded anything unless they absolutely had to. So a simple reboot got them all up in arms. If one node in an HA pair was down, they would panic and flap for hours. Cue multi-hour conference calls. No-one had any experience dealing with that sort of thing. I’d just shrug, and get on with fixing it.

    I can understand change control. But you have to have solid technical systems, or you end up like that organisation, where the underlying technical pieces were in bad shape, but change control ruled with an iron fist, and no-one bothered fixing anything, because the paperwork was too much hassle. And no-one funded removing old kit. Projects would just keep slapping on more bits and pieces. When I did do some digging around I found multiple routers and servers doing nothing (e.g. router with a WAN link down for > 12 months). But no-one wanted to remove them, because it was no-ones responsibility/project, and besides, the paperwork would be too much.

    Glad I don’t have to deal with that any more.