It Works On My Machine
Ensuring Python apps work everywhere, forever
November 30, 2022
It Works On My Machine
Ensuring Python apps work everywhere, forever
Published: November 30, 2022.
tl;dr
Ensuring Python reproducibility is a rabbit hole. But, the good news is you can make your code way more reproducible with minimal effort.
Simply make a virtual environment for your project, configure the environment as usual, and then use this gist to lock-in your environment.
Continue reading to venture down the rabbit hole.
Intro
This blog post is for anyone who writes Python code and wants their code to reliably work on other machines. I’ll discuss a series of ways reproducibility can be achieved, each with increasing levels of reliability and effort effort, and each with diminishing returns.
Achieving Python reproducibility is definitely an instance where the Pareto principle (80-20 rule) should be applied. Unfortunately, most Python developers put very little effort (<10%) into this issue, and thus their code isn’t very reproducible (<10%). My aim for this blog post is to show you how with very little effort (20%) you can make your Python code a lot more reproducible (80%).
I will assumes you already have a list of packages on which your code
depends (e.g., in a requirements.txt
file), which I will
refer to as Level 0 reproducibility.
# requirements for this website
feedgen
lxml
pip
python-dateutil
pytz
PyYAML
setuptools
six
Level 0 (<10% effort, <10% reproducible)
You’re already here, and your code works, so what can go wrong? The most likely way the code will break is if one of the dependent packages has a breaking change. When you install packages on another machine, the packages that will be installed will be different to the ones that work with your code.
Do you even know the version of every package you have installed? Of course you don’t.
Level 1: Pin to minor version (~10% effort, ~50% reproducibility)
We can level up our requirements.txt
by pinning the
packages to the versions that are currently installed. This will ensure
that the installed packages will be the same as the ones you’ve
currently got installed.
feedgen==0.9.*
lxml==4.9.*
pip==22.3.*
python-dateutil==2.8.*
pytz==2022.6.*
PyYAML==6.0.*
setuptools==63.2.*
six==1.16.*
Notice here that I’ve pinned these packages to the minor version, but left a wildcard in the bug-fix version. Provided that the package maintainers are adhering to semantic versioning practices, this will allow newer versions to be installed if they fix bugs, but these versions should not introduce any breaking changes.
You can use this
gist to automatically generate a requirements.txt
file
with pinned packages from your live environment. Running this
script is a near-zero effort action you can take to help your code run
on other machines.
#!/usr/bin/python3
# Generate requirements.txt automatically from live environment
# Run like this: ./get_package_versions.py >> requirements.txt
# Requires pip>=22.* (if fails, try: python3 -m pip install --upgrade pip)
from pip._internal.metadata import get_default_environment
wildcard_bugfix = lambda ver: ".".join(ver.split(".")[:2]) + ".*"
reqs = "\n".join(
f"{package.metadata_dict['name']}=={wildcard_bugfix(package.metadata_dict['version'])}"
for package in sorted(list(get_default_environment().iter_installed_distributions()), key=lambda x: str(x).lower())
)
print(reqs)
Level 2: Use virtual environments (~20% effort, ~80% reproducibility)
This is the level that offers the most reproducibility for the least effort.
When you do pip install new_package
, pip
’s
dependency resolver prioritises new_package
’s dependencies,
and can potentially break dependencies of previously installed packages.
Now, because you’ve already got a requirements.txt
file
with pinned versions, you shouldn’t ever need to run
pip install new_package
. But you might do this
inadvertently if using a system-wide Python environment.
This issue can be resolved by using virtual environments. Simply, run:
python3 -m venv env
source env/bin/activate
python3 -m pip install -r requirements.txt
python3 my_code.py
deactivate
Now you have an isolated environment that you’re not likely to tamper with, and it’s also a fresh environment that other people can start from to run your code.
Level 3: Use conda (~50% effort, ~90% reproducibility)
From this step onwards is where diminishing returns really takes hold.
One issue with pip
+ venv
is that it
doesn’t manage system/non-Python dependencies. conda
does
this, and it also manages environments and installs a consistent set of
packages. If your code is highly dependent on system/non-Python
dependencies (e.g., torch
requires cuda), then using conda
will automatically install the correct version of the these
dependencies.
While using conda comes with additional reproducibility benefits, it
also has other issues. For example, not all pip
-installable
packages are conda
-installable packages. If you rely on a
package that isn’t conda
-installable, then you might have
to use a combination of pip
and conda
. This
type of environment is particularly precarious, since conda cannot
manage the pip
packages in the environment.
conda
also requires additional effort to configure and is
less well-known than pip
.
A conda
environment can be managed similarly to a
venv
virtual environment, and can be specified in an
env.yml
file similar to how a requirements.txt
file is specified. See the examples below:
# An example of a `env.yml` file
name: my_env
channels:
- nvidia
- pytorch
- conda-forge
dependencies:
- black=21.*
- cython=0.29.*
- fastparquet=0.8.*
- graph-tool=2.44.*
- gudhi=3.5.*
- hdbscan=0.8.*
- ipywidgets=7.6.*
- networkx==2.6.*
- numpy==1.22.*
- pandas==1.3.*
- pandas-profiling=3.1.*
- papermill=2.3.*
- ploomber=0.21.*
- python=3.10.*
- pyyaml=6.0.*
- ripser=0.6.*
- sacred==0.8.*
- scikit-learn==1.0.*
- scipy==1.7.*
- seaborn=0.11.*
- tqdm=4.62.*
- typer=0.4.*
- umap-learn=0.5.*
- widgetsnbextension=3.5.*
# torch
- pytorch::pytorch=1.12
- pytorch::torchvision
- pytorch::torchaudio
- nvidia::cudatoolkit=10.2
# An example of how to create a conda environment
conda env create --name my_env --file=env.yml
conda activate my_env
python3 my_code.py
conda deactivate
Level 4: Docker (~80% effort, ~95% reproducibility)
The final level of reproducibility is achieved with the use of
docker
. docker
allows you to essentially
specify the entire software stack so that an environment can be exactly
reproduced.
Here is an example of a Dockerfile
that uses
conda
:
# This image comes with conda pre-installed
# As you've probably guessed, it's good practice to specify the versions of everything, not just python packages
FROM jupyter/minimal-notebook:2022-11-28
# Install dependencies
COPY env.yml env.yml
RUN conda env update create --file env.yml
One way that this method can fail is if the relied upon packages (be
them from docker
, pip
, or conda
)
are no longer available. This is not a common problem, but it has bitten
me a few times.
Level 5: A bundled docker image (~100% effort, ~99% reproducibility)
While a Dockerfile
is a recipe for a software
environment, a docker
image is a software
environment. Docker images can be distributed via
registries (e.g., [hub.docker.com]), or by saving them
directly (e.g., docker save -o my_image.tar centos:16
).
This will ensure that you have all the software to run your code. As
far as limitations go, obviously firmware/drivers (i.e., software that
interfaces directly with hardware) are not managed by
docker
, and so could vary from system-to-system. But this
blog post has to end somewhere, and so I think I’ll leave it at
that.