Troubleshooting, itís something we all do. It may be your parents Internet connection, the company photocopier, your kidís car or the core network of a medium sized bank.
I was very fortunate in that I was formally taught how to troubleshoot. During my three-year electronics apprenticeship with the Royal Air Force.† Every Wednesday afternoon we had ëWorkshop Practicesí.† Amongst other things this taught us a method of troubleshooting problems that could be applied to Radio Transmitters, Airborne Radar Systems or Washing Machines. You would expect that following this period of formal training and thirty years of troubleshooting experience, that I would be amongst one of the Worlds leading ëtroubleshootersí. However, troubleshooting is not like that, itís more like football, years of watching and playing does not guarantee to make you a superstar. However, with troubleshooting, just like football, it does teach you enough so you have an opinion on how it should be done.
Always start with basics
In the UK we have two major roadside breakdown agencies for owners of motor vehicles, the AA and RAC. They attend thousands of breakdowns daily and their success is based on how quickly they can fix a car and get it moving again. When they attend a breakdown and the engine wonít start they will always ask the same first question ìis there any fuel in it?î, and then, more importantly, regardless of your answer, they will check that you have fuel. Important lesson there somewhereÖ.
Just give me the facts
Our Service Desk engineers deal with a lot of different customers ranging from Telephone Receptionists complaining about their ARC console to CCIE trainees struggling to get firewall rules working to Network Managers learning for the first time that Spanning Tree is enabled by default for a reason. The variable factor amongst all these scenarios is the amount of information the ëvictimí has for us. It ranges from a ësh techí on half-a-dozen network devices to simple two word messages; ìItís brokenî, although the word ëbrokení is often replaced by more emotive language. Contrary to popular opinion, the first thing to ascertain, is not ìwhat is it thatís broken?î. The first thing to understand is ìwhat is it you could do then that you canít do now?î. The reason for this is so you can gather the facts and not opinion, rumour hearsay or gossip.
Thereís a reason we are all taught the 7-layer OSI model.
Is there?…. Yep, itís to give us a starting point to our troubleshooting deliberations. Layer 1, physical connectivity. I have lost count of the number of times Iíve seen people gone off looking at memory leaks, software bugs and device configurations when the actual problem is caused by something, somewhere being disconnected. Even when the problem ëappearsí to have connectivity, e.g. poor Internet performance, it may be that the primary link is down and everything is going via an oversubscribed backup link. Once physical connectivity is confirmed, work your way up the layersÖ and yes I include the well know unofficial eighth layer of the model.
If you canít fix it, then bodge it.
Ok, perhaps what I actually mean is ëimplement a workaroundí, but I always like to grab peopleís attention when I have something important to say. A lot of people forget the ëvictimsí in our network problems we get called in to fix, they forget that what the customer wants more than anything is to restore service. So if that involves a crafty bit or re-routing or temporary removal of a resilience feature then I’m sure theyíll be happy if that means their business can get back to making money. We can then return in the ëdead of nightí to fix it properly.
Identify what are the symptoms and what is the cause. Two good reasons for this: Firstly a lot of time is wasted fixing symptoms while the root cause goes ignored. Secondly, only when you understand, document and fix the root cause will you be able to sleep peacefully that night.
Beware of strangers claiming to know ëquick fixesí.
There are a group of people in every organization who offer advice such as ìthe IOS on that box was upgraded a week agoî or ìI had a memory leak on that model of router in my last jobî or, most annoyingly, they offer pearls of wisdom regarding making changes to ACLís or removing lines of config to see if that will fix it. Remember, randomly changing things only leads to random chances of fixing the problem. Whilst I’m a big believer of asking ëwhat was the last thing to be changed?í, I donít automatically assume that it will provide the answer to the problem. If this was the case then our friends in the AA and RAC would always be checking the wheels of the motor cars that have engines that donít start.
Know what the network looks like when itís not broken.
I know this is not always possible but itís a very lonely feeling sitting at a router console at 2am wondering if the little 2960 switch on the far edge of the network has always been the Root Bridge or speculating if inter-site traffic is supposed to go via the 2Mb SDSL link rather than the high-speed links via head-office.
Documentation, documentation, documentation.
There is a rule somewhere that says the longer the troubleshooting lasts the more documentation that should be produced. In the past Iíve had a problem escalated to me where following two days of intense troubleshooting by a CCIE and CCNP the only documented evidence of their efforts they could produce was an ARP table with a single entry highlighted. Make you sure document the symptoms, what you do, what you eliminate what you change, when you change it and if the symptoms change.
One of the biggest dangers in troubleshooting is introducing a second problem before fixing the first.
Always have a plan. Why? Canít we just wander a long in a random haphazard way changing things that take our fancy, rebooting things several times, not telling people and just change†some of it†back if it doesnít work?….. No, if you do that you should be working in programming.
Know when to escalate.
There will be times when you canít fix the problem. I know that is particularly difficult, especially for a CCIE to accept, but it is possible. Itís always better to escalate early rather than later.
I do hate that phrase; however, with large service impacting problems that canít be fixed quickly then calling in experts from different teams with different disciplines is the next best logical thing to do. Always, always make sure someone is appointed as the team-leader and that they are capable of showing leadership. Work together as a team and develop a plan. Keep management informed and donít forget to check out the number of the local Dominoís Pizza.
Apologies if you got to the bottom of this blog expecting to find a list of my favourite tools to troubleshoot with, or a list of Silver Bullets collected over the years to be applied to troublesome networks and their configurations. If thatís what you want then I would suggest starting at http://packetlife.net/armory/ which has a comprehensive list of tools. But to be honest all the troubleshooting tools in the World will be wasted if you donít follow the basic rules. As I was taught on those Wednesday afternoons thirty years ago, itís not what you use, itís how you use it.
Kevin recorded a podcast at covered some of these topics. You can find the podcast at Packet Pushers – Runt Packet No 5 – A Technical Services Manager Speaks Out