< expLog

A Notebook Style Guide

I ended up giving a talk at JupyterCon a few months after writing this note, which includes the latest version of the style guide. You can find the talk on YouTube, and the corresponding notebook on Github.

After writing and reading several python notebooks I've observed patterns that make notebooks significantly easier to write, read and iterate on. This style guide is an attempt at formalizing the patterns.

Like any good style guide there are inherent contradictions: best addressed by a liberal dose of context and a pinch of taste.

The Style Guide

Carefully choose global state, and transform it with pure functions.

Handling this correctly automatically forces an excellent structure on the notebook.

The global, shared state identifies the core of the notebook. Ideally, the state is never mutated in place: instead any transformations are copied into a new variable.

Pure – side effect free – functions can work on the data, and automatically structure the notebook into stages. It also makes it trivial to write a quick test for the functions right next to them with toy inputs.

Retaining the original data and the lack of side effects makes it significantly easier to re-run sections of the notebook confidently.

One common example would the core dataset being analyzed in a notebook: it can be very valuable to have the original dataframe available, with any transformations creating copies.

Each cell should be responsible for one thing.

A cell can define a function, a class, or a snippet of code to be executed. Alternatively, it can be one paragraph or section of text in the notebook.

Maintaining tight, one-idea cells makes for cleaner diffs and clearer histories for notebooks maintained in source control.

Liberally include assertions and tests through the notebook.

A quick assertion or simple unit test at the end of any function or class definition can prove invaluable in debugging and extending notebooks.

Assertions also allow for quick iteration using Ctrl + Enter while iterating on the contents of a cell to quickly sanity check it's contents.

Notebooks must be written like prose.

As true as this statement is for code, it's even more true for a Notebook. A good notebook must be written keeping the audience in mind: emphasizing code and prose appropriately.

Accordingly, style guides apply perfectly: I strongly recommend On Writing Well.

Structure the notebook clearly with well-defined headings.

Use headings liberally to structure the notebook into digestible pieces.

Most reasonable renderers will also generate a Table of Contents to make headings even more valuable for quickly navigating the document and getting a quick overview.

Notebooks should follow best practices for programming.

Code within notebooks should be carefully structured to stand well by itself as a program.

The standards we've adopted for good design don't disappear because it's an interactive environment:

  • abstract well, and have consistent levels of abstraction.
  • balance coupling and cohesion.
  • trade-off YAGNI and DRY as appropriate.

An interactive environment gives even more opportunities to get it right and refactor quickly; though tooling support for refactoring in most notebook clients tends to be non-existent.

Simple rules of software engineering also apply: stick to the PEPs, avoid lint errors and maintain conventions.

Notebooks should be reproducible.

Reproducibility depends on the nature of the notebook: it doesn't necessarily mean that re-running a notebook should produce exactly the same outputs, but the central thesis of the notebook should stand.

The underlying data – or random value generating a notebook should be allowed to update without breaking the notebook.

While it may not be feasible to snapshot and include all the data used within a notebook, where and how to access it should be clearly documented.

Similarly, there should be a clear description of the packages, libraries and potentially even hardware required to re-run the notebook.

Notebooks should be executable directly with a "run-all".

Few things signal a sloppy notebook more than one which fails to execute with "Run all cells".

Ensure that functions and variables are available in the right order.

One simple sanity check is to execute "Run All" successfully as a pre-cursor to publishing an notebook.

Minimize noise from unintentional output

Libraries and function calls can be noisy, and generate outputs indicating query progress, incremental logging with progress bars or otherwise unnecessary output.

Eliminate these to minimize visual noise in the notebook.

At the same time, be very intentional about retaining all potentially useful information for anyone simply reading the notebook.

For example, %%capture in a Jupyter notebook can help suppressing unnecessary output.

References

Books, papers, etc.

  • Donald E. Knuth. 1984. Literate Programming. The Computer Journal. British Computer Society 27 (2): 97–111. 10.1093/comjnl/27.2.97