Clean the jupyter notebooks in a git repository

# Context

The jupyter notebooks contain both code to execute, outputs and metada related with the notebook. Among those metadata, there is by example the number of times (“execution_count”) a given cell has been run.

Below is an extract of a notebook file, with in bold some of the data that will produce noisy commits.

“cells”: [
“cell_type”: “code”,
“execution_count”: 1,
“metadata”: {},
“outputs”: [
“data”: {
“text/plain”: [
“metadata”: {},
“output_type”: “display_data”
“source”: [
“R_INSTALL_DIR=paste(\”../build\”, \”wrp/sdrCore/R\”, \”\”, sep=\”/\” )\n”,
“dyn.load(paste(R_INSTALL_DIR, \”sidres\”, .Platform$dynlib.ext, sep=\”\”))\n”,
“source(paste(R_INSTALL_DIR, \”sidres.R\”, sep=\”\” ) )\n”,

Keeping the outputs and metadata such as execution_count in a git repository can be annoying since each time someone runs a notebook, those data change, even if the runned code has not changed. That produce git commits containing irrelevant changes, make the changes less readable.

The nbstripout script cleans a notebook from its outputs and metadata, letting git only seen the code parts when checking whether the notebook was modficated.

# Installation of nbstripout with conda

The nbstripout can be found at:

  • the base environment of conda contains nbstripout, to make it available in all the projects:
conda conda install -c conda-forge nbstripout
  • ${HOME}/.gitattributes contain:
*.ipynb    filter=nbstripout
*.ipynb    diff=ipynb
  • the ${HOME}/.gitconfig contain the following sections:
[filter "nbstripout"]
 clean = nbstripout
 smudge = cat 
 required = true
[diff "ipynb"]
 textconv = nbstripout -t

This file can be filled either using your favourite editor (should be vim, anyway) or using the following git commands:

git config --global filter.nbstripout.clean nbstripout
git config --global filter.nbstripout.smudge cat
git config --global filter.nbstripout.required true
git config --global diff.ipynb.textconv "nbstripout -t"

Leave a Reply

Your email address will not be published. Required fields are marked *