The Ancient and Noble Art of Troubleshooting

Troubleshooting, itís something we all do. It may be your parents Internet connection, the company photocopier, your kidís car or the core network of a medium sized bank.
I was very fortunate in that I was formally taught how to troubleshoot. During my three-year electronics apprenticeship with the Royal Air Force.† Every Wednesday afternoon we had ëWorkshop Practicesí.† Amongst other things this taught us a method of troubleshooting problems that could be applied to Radio Transmitters, Airborne Radar Systems or Washing Machines. You would expect that following this period of formal training and thirty years of troubleshooting experience, that I would be amongst one of the Worlds leading ëtroubleshootersí. However, troubleshooting is not like that, itís more like football, years of watching and playing does not guarantee to make you a superstar. However, with troubleshooting, just like football, it does teach you enough so you have an opinion on how it should be done.

Always start with basics

In the UK we have two major roadside breakdown agencies for owners of motor vehicles, the AA and RAC. They attend thousands of breakdowns daily and their success is based on how quickly they can fix a car and get it moving again. When they attend a breakdown and the engine wonít start they will always ask the same first question ìis there any fuel in it?î, and then, more importantly, regardless of your answer, they will check that you have fuel. Important lesson there somewhereÖ.

Just give me the facts

Our Service Desk engineers deal with a lot of different customers ranging from Telephone Receptionists complaining about their ARC console to CCIE trainees struggling to get firewall rules working to Network Managers learning for the first time that Spanning Tree is enabled by default for a reason. The variable factor amongst all these scenarios is the amount of information the ëvictimí has for us. It ranges from a ësh techí on half-a-dozen network devices to simple two word messages; ìItís brokenî, although the word ëbrokení is often replaced by more emotive language. Contrary to popular opinion, the first thing to ascertain, is not ìwhat is it thatís broken?î. The first thing to understand is ìwhat is it you could do then that you canít do now?î. The reason for this is so you can gather the facts and not opinion, rumour hearsay or gossip.

Thereís a reason we are all taught the 7-layer OSI model.

Is there?…. Yep, itís to give us a starting point to our troubleshooting deliberations. Layer 1, physical connectivity. I have lost count of the number of times Iíve seen people gone off looking at memory leaks, software bugs and device configurations when the actual problem is caused by something, somewhere being disconnected. Even when the problem ëappearsí to have connectivity, e.g. poor Internet performance, it may be that the primary link is down and everything is going via an oversubscribed backup link. Once physical connectivity is confirmed, work your way up the layersÖ and yes I include the well know unofficial eighth layer of the model.

If you canít fix it, then bodge it.

Ok, perhaps what I actually mean is ëimplement a workaroundí, but I always like to grab peopleís attention when I have something important to say. A lot of people forget the ëvictimsí in our network problems we get called in to fix, they forget that what the customer wants more than anything is to restore service. So if that involves a crafty bit or re-routing or temporary removal of a resilience feature then I’m sure theyíll be happy if that means their business can get back to making money. We can then return in the ëdead of nightí to fix it properly.
Identify what are the symptoms and what is the cause. Two good reasons for this: Firstly a lot of time is wasted fixing symptoms while the root cause goes ignored. Secondly, only when you understand, document and fix the root cause will you be able to sleep peacefully that night.

Beware of strangers claiming to know ëquick fixesí.

There are a group of people in every organization who offer advice such as ìthe IOS on that box was upgraded a week agoî or ìI had a memory leak on that model of router in my last jobî or, most annoyingly, they offer pearls of wisdom regarding making changes to ACLís or removing lines of config to see if that will fix it. Remember, randomly changing things only leads to random chances of fixing the problem. Whilst I’m a big believer of asking ëwhat was the last thing to be changed?í, I donít automatically assume that it will provide the answer to the problem. If this was the case then our friends in the AA and RAC would always be checking the wheels of the motor cars that have engines that donít start.

Know what the network looks like when itís not broken.

I know this is not always possible but itís a very lonely feeling sitting at a router console at 2am wondering if the little 2960 switch on the far edge of the network has always been the Root Bridge or speculating if inter-site traffic is supposed to go via the 2Mb SDSL link rather than the high-speed links via head-office.

Documentation, documentation, documentation.

There is a rule somewhere that says the longer the troubleshooting lasts the more documentation that should be produced. In the past Iíve had a problem escalated to me where following two days of intense troubleshooting by a CCIE and CCNP the only documented evidence of their efforts they could produce was an ARP table with a single entry highlighted. Make you sure document the symptoms, what you do, what you eliminate what you change, when you change it and if the symptoms change.

One of the biggest dangers in troubleshooting is introducing a second problem before fixing the first.

Always have a plan. Why? Canít we just wander a long in a random haphazard way changing things that take our fancy, rebooting things several times, not telling people and just change†some of it†back if it doesnít work?….. No, if you do that you should be working in programming.

Know when to escalate.

There will be times when you canít fix the problem. I know that is particularly difficult, especially for a CCIE to accept, but it is possible. Itís always better to escalate early rather than later.

Tiger teams.

I do hate that phrase; however, with large service impacting problems that canít be fixed quickly then calling in experts from different teams with different disciplines is the next best logical thing to do. Always, always make sure someone is appointed as the team-leader and that they are capable of showing leadership. Work together as a team and develop a plan. Keep management informed and donít forget to check out the number of the local Dominoís Pizza.

Apologies if you got to the bottom of this blog expecting to find a list of my favourite tools to troubleshoot with, or a list of Silver Bullets collected over the years to be applied to troublesome networks and their configurations. If thatís what you want then I would suggest starting at http://packetlife.net/armory/ which has a comprehensive list of tools. But to be honest all the troubleshooting tools in the World will be wasted if you donít follow the basic rules. As I was taught on those Wednesday afternoons thirty years ago, itís not what you use, itís how you use it.

Footnote

Kevin recorded a podcast at covered some of these topics. You can find the podcast at Packet Pushers – Runt Packet No 5 – A Technical Services Manager Speaks Out

  • Pingback: The Art of Troubleshooting | The Online CCNP Study Guide()

  • http://www.xdroop.com/404.html David Mackintosh

    Excellent article. I think though you mean you were “formally” taught how to troubleshoot.

    • Kevin Bovis

      yep, sorry, dead right.

  • Jack

    Does Etherealmind’s font look really bad to anyone else? Looks OK in IE and Safari but it’s awful in Firefox.

    • Kevin Bovis

      Is this a troubleshooting test? If so, I forgot to mention that other necessity of troubleshooting… delegation.. Over to you Greg.

      • http://etherealmind.com Greg Ferro

        That’s because Firefox hasn’t got good support for HTML5 and font downloads. Time to use a better browser until Firefox can catch up.

  • http://www.certsportal.com/cisco/ccnp.html Karl Taylor

    Font in Firefox no doubt is awful
    But well written article , in my perception troubleshooting itself is an “art”

  • Tim Smith

    FYI, I did my own brief ‘troubleshooting’ on the ugly fonts. Disallowing Javascript from typekit.com (with the handy NoScript Firefox addon that you really should be using) returned the fonts to a regular low ugliness level.

    • http://etherealmind.com Greg Ferro

      I’m using Typekit for CSS3 based fonts that use HTML 5 and they look fantastic in a modern browser and operating system.

      You should consider upgrading your OS or Web Browser if they don’t support HTML5 standards.

  • Tim Smith

    Looks like it might be an MSTSC issue with font smoothing. Using Firefox 3.6.8, it looks fine running locally on a Windows 7 client but god-awful over RDP using the MS terminal services client.

    See http://users.on.net/~timsmith/font-broken.png

    So if your terminal is somewhat broken like mine, a quick fix is to disallow javascript from typekit.com so you don’t see the fancy CSS fonts.

  • Khan

    Views a lot good now.. in IE 8!! feeling so good!!

  • http://aconaway.com Aaron

    Good stuff, Kevin.

    My good buddy is an a paramedic, and he has great stories about not providing the right information for troubleshooting. The best one is a call about a man having severe chest pains. They rush to get the guy and prep for a heart attack case. When they show up, the man has a 4″ knife sticking out of his chest. While it’s true he is having chest pains, it would have been a little better to mention the knife.

  • SuSi

    Very good article,
    one thing is missing:
    It can be a double fault.
    Nothing is more bad than to find two faults with the same (or nearly the same) behavior.
    simple example:
    Server is not booting via PXE anymore:
    DHCP server is down AND vlan config is wrong on downlink.
    You fix the vlan config, but it is still not working.
    NOW You change it back (because it is still not running) -> very bad idea.