Over the years I’ve wondered what it is that skilled IT folks are doing to solve issues. There are groups of people who seem to have wizard-like abilities to solve issues. In the past I’ve often wondered how much I would need to learn to have this mythical gift they seemed to posses. I’ve been in IT for about 8 years or so and I’ve never felt like I’ve ever known enough. But something interesting did change from my early in my career. I’ve definitely learned more and have the wisdom experience brings. I’m able to solve things fairly regularly and with a high degree of accuracy. But up until recently, I had no idea why exactly this was.
I’ve taken some time to think about how exactly those of us in the business are solving issues. How it is that with very little knowledge of the technologies we work with sometimes, we’re still able to quickly diagnose and troubleshoot these issues. So, Here’s a list of 7 steps that I’d like to call the 7 Steps of IT Troubleshooting.
Step 1. Get situational awareness.
Take time to understand the issue being reported. Gather information and understand how it occurs. Have workflows, diagrams, or other documentation on the system you are troubleshooting. Have Asset information ready(computer names, name of the person who reported it, etc..). This step seems pretty obvious, but it’s amazing how often this basic step is missed. For example, you may get a report that the internet is down, only to open up a browser and get to a website just fine. Upon investigation, you may find out that a single person’s computer couldn’t reach the internet, or even a specific website.
Users don’t usually speak in technical terms and probably wouldn’t need your help if they had the technical understanding of the issue they are experiencing. Take time to ask probing questions and do your best to reproduce the problem. Avoid anectdotal evidence as much as you can. This is where good alerting and monitoring systems can remove humans from the equations. Let the robots do robot work and tell us what’s wrong when possible.
Again, this step might seem obvious but it does get missed from time to time. This is easily the most important part of the troubleshooting process and deserves as much time and attention as possible to minimize the amount of research we’ll need to do later.
Step 2. Try a quick fix.
Admittedly, this step is one of those things that you will really only be successful with experience. The most common example of the quick fix would be something like “reboot your computer”. However, I do want to caution that quick fixes are not always low impact fixes and we need to take time to understand the impact of anything we do in our environments. I could easily write a simple script that reboots all of the company’s servers to correct application issue, but obviously this have a massive impact on the people who depend on these systems. Even rebooting a users computer might seem trivial, but if this is your CEO’s computer and important unsaved company documents are currently open, you may have just created a resume generating event. Steer towards low impact quick fixes. Now, if the quick fix did not work, we move on to….
Step 3. Diagnostics.
If this is a windows computer or server, check the event logs. If nothing relevant is in the event logs, applications will sometimes create a separate log file on the disk of the system you are troubleshooting. If you have application health monitoring systems, check those. The point of this step is to gather as much high level health information as possible to make an educated guess on what the issue might be. Something important worth pointing out here though. Logs don’t happen magically. Applications are not magically aware of problematic output or unusual behaviour. Almost every log you have ever seen was developed by a person who correctly predicted you would run into the issue you are currently experiencing. Clearly, this means this issue is common enough to merit development time by the application developer and there’s always going to be issues that they did not expect. If by the end of this step you still do not find a solution then off we go to…
Step 4. Comparisons.
So at this point, we have a good understanding of the issue. We’ve tried something simple and couldn’t find relevant logs with information about the issue we are troubleshooting. So what do we do? Well, all hope is not lost. Start running some tests to find out if this issue occurs at the same time every day? Does it matter who is executing the workflow or does the issue affect everyone? If we swap between desktops, is the issue still there? IF you have have similar environments (like a test or development environment). If you have working comparable systems, the trick here is to simply find what’s different with the problematic system. However, if you still cannot find a solution, then we go to the dreaded step 5.
Step 5. Research.
This step is usually an exercise in your google-fu. Being able to effectively find information is going to be a very important skill at this step. Funny enough, if you google (how to google), you can find some pretty useful info graphics and articles on effective search engine strategies. Just to be clear, you an also use technical books, documentation, or in-house knowledge bases. I would also be very mindful of how much time you spend at this step. Keep in mind that you are not alone and if you are still struggling to find solutions at this point, we move on to step 6.
Step 6. Escalate
Let’s face it, we couldn’t figure it out. But that’s OK! We don’t always have the answers to everything, but we have to know when to let things go. I think there’s an inherent need for all of us to be the heroes of the department. To be the ones who can say “I have figured this out!” There’s nothing wrong with reaching out for help though. Reach out to higher tier support if available. If not, reach out to the appropriate vendors for support. Sometimes even just talking the issue out can help in finding those “Aha!” moments.
Step 7. Document and/or Automate
We did it! We figured it out and solved the issue! But, there’s still a step. We’ve got to document how we resolved the issue. Lets do what we can to avoid solving the same issues over and over again. We also want to spread the knowledge to other teams to assist them in solving the issues in the future.
Automate where you can as well. This might mean writing some scripts for support staff to speed up fixing issues, updating base images, or creating new GPO’s to prevent these issues.