expLog

Approaching Debugging

This is my approach to debugging. There are many like it, but this one is mine1.

Approaching debugging

Breathe

The first and most important aspect of tackling a thorny problem is to attack it with the right mindset.

I like to approach bugs calmly, with curiosity, and (possibly undue) optimism. Frustration, panic, and pessimism induce tunnel vision and make the problem much harder to fix.

It's just code. You can fix it.

Establish boundaries

Understanding the problem is obviously important, there's a cost/benefit analysis to do first: how important is this problem, and how much time and how many engineers you're willing to spend to fix it.

Is there a threshold at which it would be better to patch over the problem instead? Or leave it unfixed 2?

Write everything down

Debugging often involves switching between deep depth-first and wide breadth-first searches while diagnosing the issue: writing things down acts as an additional form of memory, quickly brings additional people up to speed.

Most importantly, it can also help highlight what hasn't been tried yet.

Written notes also come in handly after the problem has been solved: for retrospectives, spreading knowledge, and building a collection of war stories.

Define the problem

Make sure you clearly understand and can describe the issue: what's the actual behavior and what's the expected behavior?

There have been times where I've misunderstood complex interactions between many systems to be a bug instead of an acceptable outcome.

Mitigate

All through the process, you must constantly evaluate if / when you need to mitigate the problem, even if you don't quite know the right solution yet.

A common solution is to revert to a known-good state; patch around the problem, or disable a specific feature while the behavior is fixed.

You should consider your confidence in the mitigation, the ease and speed of deployment, the urgency of the problem and the likelihood of finding a "better" fix while making this decision.

Find clues

Bugs can have several distinctive characteristics: quickly identifying these can speed up debugging significantly.

Some common tactics to apply:

  • Are there any hints in logs or telemetry?
  • Where does it happen: specific devices, only in production, only in the development environment, everywhere?
  • Who does it occur for? Are there any specific characteristics for affected users / identifying unit for the bug? Instrumentation can be very valuable here.
  • Does it reproduce consistently? Is it (shudder) a race condition?
  • When did it start happening? What changed around that time?
  • Visualize any available data to look for patterns.

Ideally you have enough instrumentation, telemetry and samples to be able to answer these questions quickly. Alternatively, collecting a small set of examples can significantly reduce the search space.

Experiment

Based on the clues you've carefully written down, you might have some intuition of where the problem is potentially happening: this step relies on your understanding of the system.

Another way to build this intuition is to look for the first point at which reality diverges from your mental model of the system: understand why.

Make a hypothesis first, predict what will happen, and go and apply the change. Change one thing at a time to make sure you can reason clearly around cause and effect.

Figure out how to test changes cheaply and have a tight feedback loop. Consistently look for reasons to disprove your hypothesis.

There will be times when it'll be multiple tiny bugs coming together in wonderfully imaginative ways, where you'll need to apply mulitple fixes. Remember to maintain a record of everything that's been tried!

Think laterally

Sometimes, things just don't fit together, or there's just no time to play "scientist" and experiment. You can always try to brute force the solution by treating the system like a black box and looking outside it instead.

By looking at when the problem started happening, you can understand what changed. Look through the commit history; look at the changes in the surrounding systems; look through updates to hosts, related applications. Perhaps something interesting happened in the world, causing load on the service to spike and servers fell over.

If nothing in change logs or release notes looks out of ordinary, fall back to binary search. Simply revert the system to a known good state, and walk forward till it starts failing; alternatively, go backwards till it starts working.

Step away

If nothing else works, taking a break from the problem and letting your subconscious deal with it can produce really good results.

Distract yourself with other, preferably nontechnical things: go for a walk, move away from the computer, doodle: and recover.

Fix, verify and celebrate!

Once we have an acceptable fix, we need to release it: which is something I'll let you deal with. Make sure the actual and expected behavior match up again.

Celebrate! It's time to savor victory: recover from the stress and take a break.

Things are right with the world again.

Learn from it!

You aren't done yet.

There's a lot to learn from a complex bug: and a lot to do to make it simpler to deal with the next one.

Build your own knowledge

Learning and becoming more familiar with tools that could help debug the problem sooner can help speed you up significantly the next time around.

Write a postmortem

For meaningful investigations, writing it up and sharing it helps teach others and potentially speed them up significantly.

Make it impossible to happen ever again

Add the right regressions tests to prevent it from ever happening again. Add the right unit tests to exercise the responsible piece of code carefully.

Alternatively, simple delete the outdated and unused abstraction if possible.

Make it very easy to debug if it does happen again

Some clues might have been much more helpful than others. If you had to debug this again, what signal would have made this trivial to understand and fix?

Add the telemetry, logs, alarms or dashboards that would make this bug a trivial nuisance if it dares to show itself again. 3

debugging.png

War stories

I've worked on reliability across both facebook.com's web stack and Facebook for Android; battery consuming bugs across the whole suite of Facebook's Android applications, and maintained my share of small services that went bump in the night.

My worst night was probably when I was too exhausted to debug, didn't have anyone to fallback on and simply set a watch -n 600 pkill $SERVICE to keep a semi-broken pipeline running through the night till I recovered.

My worst couple of days were a graphics driver bug on an HTC device that wouldn't recycle bitmaps correctly; I never really found a solution but I was happy enough to prove that it was broken specifically on that device and not a more general issue.

My worst couple of months were debugging metrics drops while improving the efficiency of a task scheduler on Facebook for Android. I ended up writing a significant amount of software: including a custom test harness to visualize what was happening before I could figure out what was happening.

I like to believe I've earned my software debugging scars. I'm sure yours are more impressive – and I'd love to hear about strange issues you solved, what you do differently from my approach, and any impressive tricks you used to figure things out: email, Twitter 4.

History

2020-05-30: Published.

1

I've written about this as a single player game, but making it multi-player – particularly with experienced players – can be significantly better for your time to resolution and sanity.

2

Leaving it unfixed is not a decision to be taken likely: bugs, particularly heisenbugs compound rapidly and can render a codebase impossible to work with or improve. Incurring technical debt should be a very explicit, conscious decision.

3

I actually use this an exercise every time I add logging to a system: I hypothesize debugging an issue in the surrounding code, and consider what information would help me. Then add the logs.

4

I'm too lazy to build a comment system for the blog, but I'll happily copy/paste interesting comments in here.

view source