#+TITLE: Approaching Debugging
This is my approach to debugging. There are many like it, but this one
* Approaching debugging
The first and most important aspect of tackling a thorny problem is to
attack it with the right mindset.
I like to approach bugs calmly, with curiosity, and (possibly
undue) optimism. Frustration, panic, and pessimism induce tunnel
vision and make the problem much harder to fix.
It's just code. You can fix it.
** Establish boundaries
Understanding the problem is obviously important, there's a
cost/benefit analysis to do first: how important is this problem, and
how much time and how many engineers you're willing to spend to fix
Is there a threshold at which it would be better to patch over the
problem instead? Or leave it unfixed ?
** Write everything down
Debugging often involves switching between deep depth-first and wide
breadth-first searches while diagnosing the issue: writing things down
acts as an additional form of memory, quickly brings additional
people up to speed.
Most importantly, it can also help highlight what hasn't been tried yet.
Written notes also come in handly after the problem has been
solved: for retrospectives, spreading knowledge, and building a
collection of war stories.
** Define the problem
Make sure you clearly understand and can describe the issue:
what's the actual behavior and what's the expected behavior?
There have been times where I've misunderstood complex interactions
between many systems to be a bug instead of an acceptable outcome.
All through the process, you must constantly evaluate if / when you
need to mitigate the problem, even if you don't quite know the right
A common solution is to revert to a known-good state; patch around the
problem, or disable a specific feature while the behavior is fixed.
You should consider your confidence in the mitigation, the ease and
speed of deployment, the urgency of the problem and the likelihood of
finding a "better" fix while making this decision.
** Find clues
Bugs can have several distinctive characteristics: quickly identifying
these can speed up debugging significantly.
Some common tactics to apply:
- Are there any hints in logs or telemetry?
- Where does it happen: specific devices, only in production, only in
the development environment, everywhere?
- Who does it occur for? Are there any specific characteristics for
affected users / identifying unit for the bug? Instrumentation can
be very valuable here.
- Does it reproduce consistently? Is it (shudder) a race condition?
- When did it start happening? What changed around that time?
- Visualize any available data to look for patterns.
Ideally you have enough instrumentation, telemetry and samples to be
able to answer these questions quickly. Alternatively, collecting a
small set of examples can significantly reduce the search space.
Based on the clues you've carefully written down, you might have some
intuition of where the problem is potentially happening: this step
relies on your understanding of the system.
Another way to build this intuition is to look for the first point
at which reality diverges from your mental model of the system:
Make a hypothesis first, predict what will happen, and go and apply
the change. Change one thing at a time to make sure you can reason
clearly around cause and effect.
Figure out how to test changes cheaply and have a tight feedback
loop. Consistently look for reasons to disprove your hypothesis.
There will be times when it'll be multiple tiny bugs coming together
in wonderfully imaginative ways, where you'll need to apply multiple
fixes. Remember to maintain a record of everything that's been tried!
** Think laterally
Sometimes, things just don't fit together, or there's just no time to play
"scientist" and experiment. You can always try to brute force the
solution by treating the system like a black box and looking outside
By looking at when the problem started happening, you can understand
what changed. Look through the commit history; look at the changes in the
surrounding systems; look through updates to hosts, related
applications. Perhaps something interesting happened in the world,
causing load on the service to spike and servers fell over.
If nothing in change logs or release notes looks out of ordinary, fall
back to binary search. Simply revert the system to a known good state,
and walk forward till it starts failing; alternatively, go backwards
till it starts working.
** Step away
If nothing else works, taking a break from the problem and letting
your subconscious deal with it can produce really good results.
Distract yourself with other, preferably nontechnical things: go for a
walk, move away from the computer, doodle: and recover.
** Fix, verify and celebrate!
Once we have an acceptable fix, we need to release it: which is
something I'll let you deal with. Make sure the actual and expected
behavior match up again.
Celebrate! It's time to savor victory: recover from the stress and
take a break.
Things are right with the world again.
** Learn from it!
You aren't done yet.
There's a lot to learn from a complex bug: and a lot to do to make
it simpler to deal with the next one.
*** Build your own knowledge
Learning and becoming more familiar with tools that could help debug
the problem sooner can help speed you up significantly the next time around.
*** Write a postmortem
For meaningful investigations, writing it up and sharing it helps
teach others and potentially speed them up significantly.
*** Make it impossible to happen ever again
Add the right regressions tests to prevent it from ever happening
again. Add the right unit tests to exercise the responsible piece of
Alternatively, simple delete the outdated and unused abstraction if possible.
*** Make it very easy to debug if it does happen again
Some clues might have been much more helpful than others. If you had
to debug this again, what signal would have made this trivial to
understand and fix?
Add the telemetry, logs, alarms or dashboards that would make this bug
a trivial nuisance if it dares to show itself again.
* War stories
I've worked on reliability across both facebook.com's web stack and
Facebook for Android; battery consuming bugs across the whole suite of
Facebook's Android applications, and maintained my share of small
services that went bump in the night.
My worst night was probably when I was too exhausted to debug, didn't
have anyone to fallback on and simply set a watch -n 600 pkill
$SERVICE to keep a semi-broken pipeline running through the night
till I recovered.
My worst couple of days were a graphics driver bug on an HTC device
that wouldn't recycle bitmaps correctly; I never really found a
solution but I was happy enough to prove that it was broken
specifically on that device and not a more general issue.
My worst couple of months were debugging metrics drops while improving
the efficiency of a task scheduler on Facebook for Android. I ended up
writing a significant amount of software: including a custom test
harness to visualize what was happening before I could figure out what
I like to believe I've earned my software debugging scars. I'm sure
yours are more impressive -- and I'd love to hear about strange issues
you solved, what you do differently from my approach, and any
impressive tricks you used to figure things out: email, Twitter .
** 2020-05-30: Published.
I've written about this as a single player game, but making it
multi-player -- particularly with experienced players -- can be
significantly better for your time to resolution and sanity.
Leaving it unfixed is not a decision to be taken likely: bugs,
particularly heisenbugs compound rapidly and can render a codebase
impossible to work with or improve. Incurring technical debt should be
a very explicit, conscious decision.
I actually use this an exercise every time I add logging to
a system: I hypothesize debugging an issue in the surrounding code,
and consider what information would help me. Then add the logs.
I'm too lazy to build a comment system for the blog, but
I'll happily copy/paste interesting comments in here.