It Works On My Machine

Ensuring Python apps work everywhere, forever

November 30, 2022

It Works On My Machine

Ensuring Python apps work everywhere, forever

Published: November 30, 2022.

tl;dr

Ensuring Python reproducibility is a rabbit hole. But, the good news is you can make your code way more reproducible with minimal effort.

Simply make a virtual environment for your project, configure the environment as usual, and then use this gist to lock-in your environment.

Continue reading to venture down the rabbit hole.

Intro

This blog post is for anyone who writes Python code and wants their code to reliably work on other machines. I’ll discuss a series of ways reproducibility can be achieved, each with increasing levels of reliability and effort effort, and each with diminishing returns.

Achieving Python reproducibility is definitely an instance where the Pareto principle (80-20 rule) should be applied. Unfortunately, most Python developers put very little effort (<10%) into this issue, and thus their code isn’t very reproducible (<10%). My aim for this blog post is to show you how with very little effort (20%) you can make your Python code a lot more reproducible (80%).

I will assumes you already have a list of packages on which your code depends (e.g., in a requirements.txt file), which I will refer to as Level 0 reproducibility.

# requirements for this website
feedgen
lxml
pip
python-dateutil
pytz
PyYAML
setuptools
six

Level 0 (<10% effort, <10% reproducible)

You’re already here, and your code works, so what can go wrong? The most likely way the code will break is if one of the dependent packages has a breaking change. When you install packages on another machine, the packages that will be installed will be different to the ones that work with your code.

Do you even know the version of every package you have installed? Of course you don’t.

Level 1: Pin to minor version (~10% effort, ~50% reproducibility)

We can level up our requirements.txt by pinning the packages to the versions that are currently installed. This will ensure that the installed packages will be the same as the ones you’ve currently got installed.

feedgen==0.9.*
lxml==4.9.*
pip==22.3.*
python-dateutil==2.8.*
pytz==2022.6.*
PyYAML==6.0.*
setuptools==63.2.*
six==1.16.*

Notice here that I’ve pinned these packages to the minor version, but left a wildcard in the bug-fix version. Provided that the package maintainers are adhering to semantic versioning practices, this will allow newer versions to be installed if they fix bugs, but these versions should not introduce any breaking changes.

You can use this gist to automatically generate a requirements.txt file with pinned packages from your live environment. Running this script is a near-zero effort action you can take to help your code run on other machines.

#!/usr/bin/python3
# Generate requirements.txt automatically from live environment
# Run like this: ./get_package_versions.py >> requirements.txt
# Requires pip>=22.* (if fails, try: python3 -m pip install --upgrade pip)
from pip._internal.metadata import get_default_environment
wildcard_bugfix = lambda ver: ".".join(ver.split(".")[:2]) + ".*"
reqs = "\n".join(
    f"{package.metadata_dict['name']}=={wildcard_bugfix(package.metadata_dict['version'])}"
    for package in sorted(list(get_default_environment().iter_installed_distributions()), key=lambda x: str(x).lower())
)
print(reqs)

Level 2: Use virtual environments (~20% effort, ~80% reproducibility)

This is the level that offers the most reproducibility for the least effort.

When you do pip install new_package, pip’s dependency resolver prioritises new_package’s dependencies, and can potentially break dependencies of previously installed packages. Now, because you’ve already got a requirements.txt file with pinned versions, you shouldn’t ever need to run pip install new_package. But you might do this inadvertently if using a system-wide Python environment.

This issue can be resolved by using virtual environments. Simply, run:

python3 -m venv env
source env/bin/activate
python3 -m pip install -r requirements.txt
python3 my_code.py
deactivate

Now you have an isolated environment that you’re not likely to tamper with, and it’s also a fresh environment that other people can start from to run your code.

Level 3: Use conda (~50% effort, ~90% reproducibility)

From this step onwards is where diminishing returns really takes hold.

One issue with pip + venv is that it doesn’t manage system/non-Python dependencies. conda does this, and it also manages environments and installs a consistent set of packages. If your code is highly dependent on system/non-Python dependencies (e.g., torch requires cuda), then using conda will automatically install the correct version of the these dependencies.

While using conda comes with additional reproducibility benefits, it also has other issues. For example, not all pip-installable packages are conda-installable packages. If you rely on a package that isn’t conda-installable, then you might have to use a combination of pip and conda. This type of environment is particularly precarious, since conda cannot manage the pip packages in the environment. conda also requires additional effort to configure and is less well-known than pip.

A conda environment can be managed similarly to a venv virtual environment, and can be specified in an env.yml file similar to how a requirements.txt file is specified. See the examples below:

# An example of a `env.yml` file
name: my_env
channels:
  - nvidia
  - pytorch
  - conda-forge
dependencies:
  - black=21.*
  - cython=0.29.*
  - fastparquet=0.8.*
  - graph-tool=2.44.*
  - gudhi=3.5.*
  - hdbscan=0.8.*
  - ipywidgets=7.6.*
  - networkx==2.6.*
  - numpy==1.22.*
  - pandas==1.3.*
  - pandas-profiling=3.1.*
  - papermill=2.3.*
  - ploomber=0.21.*
  - python=3.10.*
  - pyyaml=6.0.*
  - ripser=0.6.*
  - sacred==0.8.*
  - scikit-learn==1.0.*
  - scipy==1.7.*
  - seaborn=0.11.*
  - tqdm=4.62.*
  - typer=0.4.*
  - umap-learn=0.5.*
  - widgetsnbextension=3.5.*
  # torch
  - pytorch::pytorch=1.12
  - pytorch::torchvision 
  - pytorch::torchaudio 
  - nvidia::cudatoolkit=10.2
# An example of how to create a conda environment
conda env create --name my_env --file=env.yml
conda activate my_env
python3 my_code.py
conda deactivate

Level 4: Docker (~80% effort, ~95% reproducibility)

The final level of reproducibility is achieved with the use of docker. docker allows you to essentially specify the entire software stack so that an environment can be exactly reproduced.

Here is an example of a Dockerfile that uses conda:

# This image comes with conda pre-installed
# As you've probably guessed, it's good practice to specify the versions of everything, not just python packages
FROM jupyter/minimal-notebook:2022-11-28
# Install dependencies
COPY env.yml env.yml
RUN conda env update create --file env.yml

One way that this method can fail is if the relied upon packages (be them from docker, pip, or conda) are no longer available. This is not a common problem, but it has bitten me a few times.

Level 5: A bundled docker image (~100% effort, ~99% reproducibility)

While a Dockerfile is a recipe for a software environment, a docker image is a software environment. Docker images can be distributed via registries (e.g., [hub.docker.com]), or by saving them directly (e.g., docker save -o my_image.tar centos:16).

This will ensure that you have all the software to run your code. As far as limitations go, obviously firmware/drivers (i.e., software that interfaces directly with hardware) are not managed by docker, and so could vary from system-to-system. But this blog post has to end somewhere, and so I think I’ll leave it at that.