Saturday, February 12, 2022

Troubleshooting


 גרסה עברית

Last week I faced a moment of frustration. And not the good "why isn't it working" kind of frustration. That kind I'm used to, and I normally end up learning something. This time I was frustrated with the people I work with. The story is quite simple - we had (yet another) problem in our systems, and they set out to investigate it. I was busy with a super-urgent-clock-is-ticking kind of task, so I wasn't available to help in this investigation. I did try to ask some guiding question, such as "what error do you see in the log", or "can you reproduce it on your system?" but other than that I was doing my best not to get involved. 

After a while they have been struggling with the problem, my manager asked me to time-box 20 minutes for this, as it was blocking most of the team. After checking that the urgent task can wait this long, I took a look. Then I got upset. Two people have been looking on this, and the best they could come up with was to quote a partial error and the step which was failing. No guess of what could have happened, no narrowing down of the problem, a simple "it fails with this message". Yet, when I took a look, it was less than 30 seconds to figure out what was wrong, then perhaps 15 more minutes to find a workaround that is actually working. 

I reflected a bit on this negative feeling - somewhere between disappointment and annoyance - and figured out why I was so upset, and this helped me notice something I didn't see before.

I was upset because I always assume that the people I'm working with are smart and capable people who are doing the best they can., and any contrary example is touching a raw nerve for me and makes me wonder why I'm bothering investing so much time and effort trying to collaborate with them instead of working individually. Then, after processing it a bit more and recalling the fundamental attribution error I could say that it's probably not that the people who failed in a task I found trivial are not smart or that they don't try their best, it's more likely that there are some other factors I'm not aware of that make this behavior reasonable. Both of them had other tasks putting pressure on them, and both are fairly inexperienced - between them they have less than 18 months of experience. In addition, it reminded me that troubleshooting is a skill that needs practice and learning, which prompted this post - I want to share the way I approach troubleshooting, hoping it might help people. 

The first thing worth noticing about troubleshooting is that almost anyone related to software development need to do this quite a lot - programmers, testers, CS, Operations, and however you might call the team managing your pipelines. The second thing worth noticing is that  it looks a lot like bug investigation, so being a better troubleshooter will make you a better tester as well. In fact, the main difference between troubleshooting and bug investigation is the goal we have: troubleshooting is about making a problem go away, or at least find a way to do our task around it, where bug investigation is more about understanding the cause and potential impact of such problem, so if a bug just flickers away we'll hunt it down.

So, how do I troubleshoot? Here's a short guide:

  1. Is it obvious? Sometimes the errors I get or the symptoms I experience are detailed enough that no investigation is actually needed. I can easily tell what has happened and skip directly to fixing or finding a workaround.
  2. Can I reproduce it? Does it happen again? if not - great, problem gone. It might come back later, in which case I might try a bit harder to reproduce it or trace its cause, but usually, a first time problem that isn't easily reproducible doesn't really need shooting. I skip to "done for now" and continue with whatever it is that needs doing.
  3. Create a mental model - what was supposed to happen? Which components are talking with which systems? What relevant environmental factors should be considered? What has recently changed?
  4. Investigation loop:
    1. gather information. Google the error message or symptom, gain visibility on the relevant flow, ask around if anyone have seen such a problem, etc.
    2. Hypothesize. Guess what might be causing the problem. 
    3. Try to disprove the hypothesis: 
      1. Create a minimal reproduction of the problem
      2. Find contrary evidence in log file, side effects, etc.
    4. Tweak. Based on my current guesses, try working around or mitigate the cause of failure. I suspect a code change? I'll revert to a previous state. Server can't be reached? I'll tweak in order to gain more information. I might check for ping, or DNS resolution. 
    5. Check to see if problem has gone away. If so - update the theory of what happened and finish.
    6. Update and narrow the model. Using the information I gained, zoom in on the relevant part of the model and elaborate it. For example, a model starting with  "I can't install the product", might narrow to "I have a remnants from a faulty uninstall that are preventing a critical operation" or to "the installation requires active internet connection and the computer has been plugged out", it can be more complicated than that. 
    7. If  I can't narrow down the model, or can't come up with a next theory of what might be wrong, I still have two options:
      1. Hail Mary - I'll change a random thing that I don't expect to help but is related in some way. For instance, I might follow instructions on the internet to find a relevant configuration change, or reboot the system. Who knows? I might be lucky and gain more information, or even make the problem go away for a while. 
      2. Ask for help. Find someone who might have more knowledge than me, or just a fresh perspective, and share my model, failed attempts and guesses I couldn't or didn't act upon, and we'll look at the problem with that person's knowledge and tools. 
  5.  Now we know what's wrong, or at least we're confident enough that we know, time to shoot down the problem. Find a way to configure a tool we were using and was causing problems, change the way we operate, sanitize our environment, or whatever will work to our satisfaction. 

That's it. I hope this flow will be helpful, at least to some extent. If you have additional tips on troubleshooting - I'd be happy to hear about them. 

No comments:

Post a Comment