Open Reproducible Data Science Workflow¶

“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” — J. B. Buckheit and D. L. Donoho, 1995

— Marwick et al, 2017, Open science in archaeology

Key themes¶

The role of transparency and reproducible workflow in scientific research.
The reproducibility problem.
Plausible sources of the reproducibility problem:
- Biases in data analysis, including p-hacking, HARKing (Hypothesizing After Results are Known), and publication bias
- Lacking transparency of computer code, data, and materials.
Rules for open and reproducible workflows
Open-source tools for reproducible research, including Jupyter and Colab notebooks, Markdown, dependency management via pip, version control.

Learning resources¶

Concepts¶

Regina Nuzzo. How scientists fool themselves – and how they can stop. Nature.

Christie Aschwanden. Science Isn’t Broken—It’s just a hell of a lot harder than we give it credit for. FiveThirtyEight.

Marcus Munafò et al. (2017) A manifesto for reproducible science. Nature Human Behaviour.

Jeffrey Perkel (2018) Why Jupyter is data scientists’ computational notebook of choice. Nature.

Tom Hardwicke et al. (2019) Calibrating the scientific ecosystem through meta-research. Annual Review of Statistics and Its Application.

Garret Christensen, Jeremy Freese, Edward Miguel (2019) Chapter 11: Reproducible Workflow. In Transparent and Reproducible Social Science Research: How to Do Open Science. University of California Press.

Reproducibility: The Basics. (With Brian Nosek)

Researcher degrees of freedom, P-hacking, and P-curve. (By Berkeley Initiative for Transparency in the Social Sciences, BITSS)

Tutorials¶

Reproducible Data Analysis in Jupyter. Jake Vanderplas.

Getting Started With the Open Science Framework (OSF). Center for Open Science.

Data Science Productivity Tools: Creating a GitHub Repository, Using git at the Command Line, Git and GitHub. Rafael Irizarry.

Introduction to Open Reproducible Science Workflows. Earth Lab CU Boulder.

Welcome To Colaboratory. Google Colaboratory.

Markdown in Jupyter Notebook. DataCamp.

Markdown Guide. Google Colaboratory.

The Markdown Guide. Matt Cone and collaborators.

What is computational reproducibility and why it matters?¶

The terms of reproducibility (and computational reproducibility) and replicability are sometimes used interchangeably but they differ.

Reproducibility means “obtaining consistent results using the same input data, computational steps, methods, and conditions of analysis”; it is synonymous with computational reproducibility.

Replicability “means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”

Why does reproducibility and replicability matter? “Reproducibility and replicability are often cited as hallmarks of good science. Being able to reproduce the computational results of another researcher starting with the same data and replicating a previous study to test its results facilitate the self-correcting nature of science.”

Why do we study computational reproducibility right now? “Computational reproducibility is more prominent now than ever because of the growth in reliance on computing across all of science. When a researcher reports a study and makes the underlying data and code available, those results should be computationally reproducible by another researcher.”

Source: Reproducibility and Replicability in Science (2019), National Academies of Sciences, Engineering, and Medicine

Our focus is on the reproducibility of our data analysis workflow, and on computational reproducibility in particular.

Open Reproducible Workflow in Jupyter/Colab Notebooks¶

An important aspect of reproducible research is the integration of various components, including data gathering, data manipulation, data analysis and outputs in an open research workflow.

The Jupyter and Colab notebooks are an open-source web applications that allow you to create and share documents that contain code, equations, visualisations and narrative text. While a popular tool for data exploration, the notebook can also support your reproducible research workflow by integrating executable code, data inputs, results, and documentation within a single notebook, along with images, HTML, LaTeX, videos and more.

In the previous session, you have learned how to use the Jupyter / Colab notebook to:

run code interactively using the Python programming language and
document your code and outputs using Markdown, an open and easy-to-use markup language for creating formatted text.

Writing your Python code and Markdown documentation within Jupyter/Colab notebooks facilitates open and reproducible science workflow through integration of various components — data inputs, code for data manipulation, analysis, visualisation, and results — within a single file that can be openly shared and communicated with others.

The Jupyter/Colab notebooks support reproducibility but do not guarantee it. In fact, a recent study of 10 million Jupyter notebooks hosted on GitHub have found that 36 per cent of the notebooks could not be reproduced because the code cells were not originally executed in a linear order. The tool is not sufficient for reproducible data analysis. We also need a reproducible research workflow that helps us transition from “nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code” (Jake VanderPlas, 2017).

Figure 1. Open and reproducible scientific workflow using Jupyter notebook and related open-source tools. Source: Juliette Taka and Nicolas M. Thiery. Publishing reproducible logbooks explainer comic strip. Zenodo. DOI: 10.5281/zenodo.4421040 (2018).

Rules for Reproducible Workflow¶

Let’s consider a few simple rules for reproducible research workflow with Jupyter/Colab notebook (Rule at al, 2019):

Document the process, not just the results — “… make sure to document all your explorations, even (or perhaps especially) those that led to dead ends. These comments will help you remember what you did and … why you chose a particular parameter value, where you copied a block of code from, or what you found interesting about an intermediate result.”

Use cell divisions to make steps clear — “… try to make each cell in your notebook perform one meaningful step of the analysis that is easy to understand from the code in the cell or the surrounding markdown description. Modularize your code by cells and label the cells with markdown above the cell. Think of each cell as being one paragraph, having one function, or accomplishing one task (for example, create a plot).”

Modularize code — “It is always good practice to avoid duplicate code, but in notebooks, it is especially easy to copy a cell, tweak a few lines, paste the resulting code into a new cell or another notebook, and run it again. This form of experimentation is expedient but makes notebooks difficult to read and nearly impossible to maintain if you want to change the functionality of or fix a bug in the copied code. Instead, wrap code you are about to copy and reuse in a function, which you can then call from as many cells as desired.”

Record dependencies — “Rerunning your analysis in the future will require accessing not only your code but also any module or library that your code relied on.”

Use version control — “Version control is a critical adjunct to notebook use because the interactive nature of notebooks makes it easy to accidentally change or delete important content. Furthermore, since notebooks contain code and code inevitably contains bugs, being able to determine the history of when a given bug you have discovered was introduced to the code versus when it was fixed—and thus what analyses it may have affected—is a key capability in scientific computation.”

Git and GitHub are widely used tools for version control [see Ten Simple Rules for Taking Advantage of Git and GitHub (Perez-Riverol et al, 2016) but are known to have a steep learning curve. For version control, we will use the built-in Colab functionality Revision history, which can be accessed from File —> Revision history.

Design your notebooks to be read, run, and reused — “… store your notebooks in a public code repository with a clear README file and a liberal open source license granting permission to reuse your code.”

Public code repositories that enable code sharing, team collaboration, and open source licensing include GitHub, GitLab, Open Science Framework (OSF).

Share and explain your data — “Having access to a clearly annotated notebook is of little use to those wanting to reproduce or extend your results if the underlying data are locked away. Strive to make your data or a sample of your data publicly available along with the notebook.”

Data repositories for medium to large sized anonymized data include figshare, zenodo, and Dryad.

Reproducible research report¶

To create a reproducible research report, use throughout the notebook:

Python code in Code cells
Hashtag symbol # in Code cells to introduce a comment line describing your Python code. Code commenting is a very important part of computational data analysis.
Markdown language in Text cells to write up your methods, results, and interpretation.

At the end of a session, rerun your notebook from top to bottom using Restart and run all (under Runtime in the Colab menu bar) to ensure computational reproducibility.

Recording dependencies¶

Throughout the textbook, we will refer to the rules for reproducible research workflow. We illustrate below a few simple rules (e.g., comment your code; use cell division to make your steps clear) with a focus on recording software dependencies, which is a key prerequisite for computational reproducibility.

Reproducing your data analysis in the future will require reusing not only your data and code but also any module and library as well as their respective versions that you employed in your code. It is a good practice to record those dependencies so that others or your future self (i.e., you in a month’s time) can recreate the environment underlying your analysis.

Let’s first determine your Python version. You can check your Python version in a Code cell by typing !python --version.

# Check Python version
!python --version

Python 3.8.5

To install and manage Python software modules or libraries, you can use a package-management system, such as pip. Using pip, you can download and install a specific version of a module/library you plan to use in your data analysis. For example, you can install the library for causal inference DoWhy version 0.6 released on 03 March 2021 by typing !pip install doWhy==0.6. Note that we use two consecutive equal marks == called the equality operator instead of a single equal marks = which is an assignment operator in Python.

# Install and import a Python library.
# The flag -q is short for --quiet and is used to hide output/warnings
!pip install -q dowhy==0.6
import dowhy

If many modules and libraries are already preinstalled, as in Colab, you can use pip to record the specific version of those modules and libraries. For example, the command !pip freeze returns installed modules and libraries listed in alphabetical order.

# List installed packages - listed are both packages available with Colab
# and packages we have installed in this session (e.g. the library DoWhy)
!pip freeze

alabaster==0.7.12
anaconda-client==1.7.2
anaconda-navigator==1.10.0
anaconda-project==0.8.3
anyio==3.1.0
applaunchservices==0.2.1
appnope @ file:///opt/concourse/worker/volumes/live/5f13e5b3-5355-4541-5fc3-f08850c73cf9/volume/appnope_1606859448618/work
appscript @ file:///opt/concourse/worker/volumes/live/50ca4c96-3090-40bb-6981-3a6114ed0af4/volume/appscript_1594840187551/work
argh==0.26.2
argon2-cffi @ file:///opt/concourse/worker/volumes/live/59af29ac-4890-416e-7ab7-794f8d6f7ecd/volume/argon2-cffi_1596828548321/work
arxiv==1.4.2
asn1crypto @ file:///tmp/build/80754af9/asn1crypto_1596577642040/work
astroid @ file:///opt/concourse/worker/volumes/live/21fd14a9-2a7e-484b-7394-5a9912cdcf80/volume/astroid_1592498459180/work
astropy @ file:///opt/concourse/worker/volumes/live/37fbd2b2-7bed-485f-777f-7939107df919/volume/astropy_1606922928626/work
async-generator==1.10
atomicwrites==1.4.0
attrs @ file:///tmp/build/80754af9/attrs_1604765588209/work
autopep8 @ file:///tmp/build/80754af9/autopep8_1596578164842/work
Babel @ file:///tmp/build/80754af9/babel_1605108370292/work
backcall==0.2.0
backports.functools-lru-cache @ file:///tmp/build/80754af9/backports.functools_lru_cache_1605305165209/work
backports.shutil-get-terminal-size==1.0.0
backports.tempfile==1.0
backports.weakref==1.0.post1
beautifulsoup4 @ file:///tmp/build/80754af9/beautifulsoup4_1601924105527/work
bio==0.0.6
biopython==1.78
bitarray @ file:///opt/concourse/worker/volumes/live/fdfca23e-4dd8-48f7-512d-c4f3db552eeb/volume/bitarray_1605065128338/work
bkcharts==0.2
black==22.3.0
bleach @ file:///tmp/build/80754af9/bleach_1600439572647/work
blis==0.7.5
bokeh @ file:///opt/concourse/worker/volumes/live/b2253281-9b72-4dcb-624e-e22924b50435/volume/bokeh_1603297849453/work
boto==2.49.0
Bottleneck==1.3.2
Brotli==1.0.9
brotlipy==0.7.0
catalogue==2.0.6
certifi==2020.12.5
cffi @ file:///opt/concourse/worker/volumes/live/730e9a28-66f9-4e03-51ad-252ec8e40d81/volume/cffi_1606255126408/work
chardet @ file:///opt/concourse/worker/volumes/live/a5a7c56b-23cb-471d-7404-f4130b8aed33/volume/chardet_1605303184529/work
charset-normalizer==2.0.12
click==8.0.3
click-plugins==1.1.1
cligj==0.7.2
cloudpickle @ file:///tmp/build/80754af9/cloudpickle_1598884132938/work
clyent==1.2.2
colorama @ file:///tmp/build/80754af9/colorama_1603211150991/work
conda==4.9.2
conda-build==3.20.5
conda-package-handling @ file:///opt/concourse/worker/volumes/live/a7e34989-4c54-4cb6-4156-4e58ee270730/volume/conda-package-handling_1603018121300/work
conda-verify==3.4.2
contextlib2==0.6.0.post1
coverage==5.5
cryptography @ file:///opt/concourse/worker/volumes/live/62f9527a-361c-47e5-5709-227a4a523add/volume/cryptography_1605544472602/work
cycler==0.10.0
cymem==2.0.6
Cython @ file:///opt/concourse/worker/volumes/live/c7485e3f-2096-4fd2-7e22-acdb1fbaa2c6/volume/cython_1605457627467/work
cytoolz==0.11.0
dash==1.21.0
dash-core-components==1.17.1

dash-html-components==1.1.4
dash-table==4.12.0
dask @ file:///tmp/build/80754af9/dask-core_1602083700509/work
decorator==5.0.9
defusedxml==0.6.0
delayed==0.11.0b1
diff-match-patch @ file:///tmp/build/80754af9/diff-match-patch_1594828741838/work
distributed @ file:///opt/concourse/worker/volumes/live/bd66aa48-5cf5-4b60-6ed4-f204fff153f6/volume/distributed_1605066538557/work
docopt==0.6.2
docutils==0.16
dowhy==0.6
en-core-web-md @ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl
entrypoints==0.3
et-xmlfile==1.0.1
fastcache==1.1.0
feedparser==6.0.8
filelock==3.0.12
Fiona==1.8.20
flake8==4.0.1
flake8-nb==0.4.0
Flask==1.1.2
Flask-Compress==1.10.1
fsspec @ file:///tmp/build/80754af9/fsspec_1602684995936/work
funcy==1.17
future==0.18.2
gensim==4.1.2
geopandas==0.9.0
gevent @ file:///opt/concourse/worker/volumes/live/e6b243ce-c4b8-40bb-4934-ef3bf1c512f2/volume/gevent_1601397552921/work
ghp-import==2.0.1
gitdb==4.0.7
GitPython==3.1.14
glob2==0.7
gmpy2==2.0.8
greenlet @ file:///opt/concourse/worker/volumes/live/02d5d57d-1f11-4cf9-580a-19e679c78dc9/volume/greenlet_1600874049903/work
h5py==2.10.0
HeapDict==1.0.1
hiredis==2.0.0
html5lib @ file:///tmp/build/80754af9/html5lib_1593446221756/work
idna @ file:///tmp/build/80754af9/idna_1593446292537/work
imageio @ file:///tmp/build/80754af9/imageio_1594161405741/work
imagesize==1.2.0
imbalanced-learn==0.8.0
imblearn==0.0
importlib-metadata @ file:///tmp/build/80754af9/importlib-metadata_1602276842396/work
iniconfig @ file:///tmp/build/80754af9/iniconfig_1602780191262/work
intervaltree @ file:///tmp/build/80754af9/intervaltree_1598376443606/work
ipykernel @ file:///opt/concourse/worker/volumes/live/88f541d3-5a27-498f-7391-f2e50ca36560/volume/ipykernel_1596206680118/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython==7.32.0
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
ipywidgets @ file:///tmp/build/80754af9/ipywidgets_1601490159889/work
isort @ file:///tmp/build/80754af9/isort_1602603989581/work
itsdangerous==1.1.0
jdcal==1.4.1
jedi==0.18.0
Jinja2==2.11.2
joblib==1.1.0
json5==0.9.5
jsonschema @ file:///tmp/build/80754af9/jsonschema_1602607155483/work
jupyter==1.0.0
jupyter-black==0.3.1
jupyter-book==0.10.2
jupyter-cache==0.4.2
jupyter-client @ file:///tmp/build/80754af9/jupyter_client_1601311786391/work
jupyter-console @ file:///tmp/build/80754af9/jupyter_console_1598884538475/work
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.1
jupyter-core @ file:///opt/concourse/worker/volumes/live/e8302867-5dbe-440b-7a37-f61bacc38ed8/volume/jupyter_core_1606148993907/work
jupyter-highlight-selected-word==0.2.0
jupyter-latex-envs==1.4.6
jupyter-nbextensions-configurator==0.4.1
jupyter-server==1.8.0
jupyter-sphinx==0.3.1
jupyterbook-latex==0.2.0
jupyterlab==3.0.0
jupyterlab-code-formatter==1.4.10
jupyterlab-pygments @ file:///tmp/build/80754af9/jupyterlab_pygments_1601490720602/work
jupyterlab-server==2.5.2
jupyterlab-spellchecker==0.5.2
jupytext==1.10.3
keyring @ file:///opt/concourse/worker/volumes/live/54fc3ec2-338b-44f5-5e13-d62afa6b5820/volume/keyring_1601490916376/work
kiwisolver @ file:///opt/concourse/worker/volumes/live/b8936fa6-0e4b-47e7-4fb4-e02dbd4505ee/volume/kiwisolver_1604014598721/work
langcodes==3.3.0
latexcodec==2.0.1
lazy-object-proxy==1.4.3
libarchive-c==2.9
linkify-it-py==1.0.1
llvmlite==0.34.0
locket==0.2.0
lxml @ file:///opt/concourse/worker/volumes/live/0c49af63-83fd-4e70-550a-65ad2757eabb/volume/lxml_1606516849441/work
markdown-it-py==0.6.2
MarkupSafe @ file:///opt/concourse/worker/volumes/live/cb778296-98db-45ad-411e-6f726e102dc3/volume/markupsafe_1594371638608/work
matplotlib @ file:///opt/concourse/worker/volumes/live/f7797860-f8aa-410c-4a56-72315954816b/volume/matplotlib-base_1603378002957/work
matplotlib-inline==0.1.2
mccabe==0.6.1
mdit-py-plugins==0.2.6
mistune @ file:///opt/concourse/worker/volumes/live/95802d64-d39c-491b-74ce-b9326880ca54/volume/mistune_1594373201816/work
mkl-fft==1.2.0
mkl-random==1.1.1
mkl-service==2.3.0
mock==4.0.2
more-itertools @ file:///tmp/build/80754af9/more-itertools_1605111547926/work
mpmath==1.1.0
msgpack==1.0.0
multipledispatch==0.6.0
munch==2.5.0
murmurhash==1.0.6
mypy-extensions==0.4.3
myst-nb==0.12.0
myst-parser==0.13.6
navigator-updater==0.2.1
nbclassic==0.3.1
nbclient @ file:///tmp/build/80754af9/nbclient_1602783176460/work
nbconvert==5.6.1
nbdime==2.1.0
nbformat @ file:///tmp/build/80754af9/nbformat_1602783287752/work
nbqa==1.3.1
nbval==0.9.6
nest-asyncio @ file:///tmp/build/80754af9/nest-asyncio_1606153767164/work
nested-lookup==0.2.22
networkx @ file:///tmp/build/80754af9/networkx_1598376031484/work
netwulf==0.1.5
nltk==3.6.7
nose @ file:///tmp/build/80754af9/nose_1606773131901/work
notebook @ file:///opt/concourse/worker/volumes/live/be0f3504-189d-4bae-4e57-c5d6da73ffcd/volume/notebook_1601501605350/work
numba @ file:///opt/concourse/worker/volumes/live/ae24c1ca-d916-4043-5919-a843fa33e451/volume/numba_1600084276085/work
numexpr==2.8.1
numpy @ file:///opt/concourse/worker/volumes/live/5572694e-967a-4c0c-52cf-b53d43e72de9/volume/numpy_and_numpy_base_1603491881791/work
numpydoc @ file:///tmp/build/80754af9/numpydoc_1605117425582/work
oauthlib==3.2.0
olefile==0.46
openpyxl @ file:///tmp/build/80754af9/openpyxl_1598113097404/work
packaging @ file:///tmp/build/80754af9/packaging_1606930849755/work
pandas @ file:///opt/concourse/worker/volumes/live/f14cf8c4-c564-4eff-4b17-158e90dbf88a/volume/pandas_1602088128240/work
pandocfilters @ file:///opt/concourse/worker/volumes/live/c330e404-216d-466b-5327-8ce8fe854d3a/volume/pandocfilters_1605120442288/work
parso==0.8.2
partd==1.1.0
path @ file:///opt/concourse/worker/volumes/live/fcdf620c-46d6-4284-4c1e-5b8c3bc6c5c6/volume/path_1596907417277/work
pathlib2 @ file:///opt/concourse/worker/volumes/live/de518564-0d9f-405e-472b-38136f0c2169/volume/pathlib2_1594381084269/work
pathspec==0.9.0
pathtools==0.1.2
pathy==0.6.1
patsy==0.5.1
pep517==0.12.0
pep8==1.7.1
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
pigar==0.10.0
Pillow @ file:///opt/concourse/worker/volumes/live/991b9a87-3372-4acd-45f9-eaa52701f03c/volume/pillow_1603822262543/work
pip-tools==6.6.0
pipreqs==0.4.10
pipreqsnb==0.2.3
pkginfo==1.6.1
plac==1.2.0
platformdirs==2.5.2
plotly==5.1.0
pluggy==0.13.1
ply==3.11
preshed==3.0.6
prometheus-client @ file:///tmp/build/80754af9/prometheus_client_1606344362066/work
prompt-toolkit==3.0.18
psutil @ file:///opt/concourse/worker/volumes/live/ff72f822-991c-4030-4f3a-8c41d3ac4e4f/volume/psutil_1598370232375/work
ptyprocess==0.7.0
py @ file:///tmp/build/80754af9/py_1593446248552/work
pybtex==0.24.0
pybtex-docutils==1.0.0
pycairo==1.20.0
pycodestyle==2.8.0
pycosat==0.6.3
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
pycurl==7.43.0.6
pydantic==1.8.2
pydata-sphinx-theme==0.4.3
pydocstyle @ file:///tmp/build/80754af9/pydocstyle_1598885001695/work
pydot==1.4.2
pyerf==1.0.1
pyerfa @ file:///opt/concourse/worker/volumes/live/5caffc18-53e2-4c2a-5220-6f94c6152218/volume/pyerfa_1606860213217/work
pyflakes==2.4.0
pygal==2.4.0
pygal-maps-world==1.0.2
Pygments==2.9.0
pyLDAvis==3.3.1
pylint @ file:///opt/concourse/worker/volumes/live/ed0164b6-bcc7-4f6b-7dd4-ad89660b5dcb/volume/pylint_1598624018129/work
pyodbc===4.0.0-unsupported
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1606517880428/work
pyparsing==2.4.7
pyproj==3.1.0
pyrsistent @ file:///opt/concourse/worker/volumes/live/ff11f3f0-615b-4508-471d-4d9f19fa6657/volume/pyrsistent_1600141727281/work
PySocks @ file:///opt/concourse/worker/volumes/live/85a5b906-0e08-41d9-6f59-084cee4e9492/volume/pysocks_1594394636991/work
pytest==6.2.4
python-dateutil==2.8.1
python-igraph==0.7.1.post7
python-jsonrpc-server @ file:///tmp/build/80754af9/python-jsonrpc-server_1594397536060/work
python-language-server @ file:///opt/concourse/worker/volumes/live/5f0313b4-ff69-4d9d-5c31-e01b223dabd6/volume/python-language-server_1594161914367/work
python-louvain==0.14
pytz @ file:///tmp/build/80754af9/pytz_1606604771399/work
pywaffle==0.6.1
PyWavelets @ file:///opt/concourse/worker/volumes/live/ea36e10f-66e8-43ae-511e-c4092764493f/volume/pywavelets_1601658378672/work
PyYAML==5.3.1
pyzmq==20.0.0
QDarkStyle==2.8.1
QtAwesome @ file:///tmp/build/80754af9/qtawesome_1602272867890/work
qtconsole @ file:///tmp/build/80754af9/qtconsole_1600870028330/work
QtPy==1.9.0
redis==3.5.3
regex==2021.11.10
requests==2.27.1
requests-oauthlib==1.3.1
retrying==1.3.3
rope @ file:///tmp/build/80754af9/rope_1602264064449/work
rpy2==3.3.6
Rtree==0.9.4
ruamel_yaml==0.15.87
scikit-image==0.17.2
scikit-learn==1.0.2
scikits.bootstrap==1.0.1
scipy @ file:///opt/concourse/worker/volumes/live/851446f6-a052-41c4-4243-67bb78999b49/volume/scipy_1604596178167/work
seaborn==0.11.2
Send2Trash==1.5.0
sgmllib3k==1.0.0
Shapely==1.7.1
simplegeneric==0.8.1
simplejson==3.17.2
singledispatch @ file:///tmp/build/80754af9/singledispatch_1602523705405/work
six @ file:///opt/concourse/worker/volumes/live/5b31cb27-1e37-4ca5-6e9f-86246eb206d2/volume/six_1605205320872/work
sklearn==0.0
smart-open==5.2.1
smmap==4.0.0
sniffio==1.2.0
snowballstemmer==2.0.0
sortedcollections==1.2.1
sortedcontainers @ file:///tmp/build/80754af9/sortedcontainers_1606865132123/work
soupsieve==2.0.1
spacy==3.3.0
spacy-legacy==3.0.9
spacy-loggers==1.0.1
Sphinx @ file:///tmp/build/80754af9/sphinx_1597428793432/work
sphinx-book-theme==0.0.42
sphinx-comments==0.0.3
sphinx-copybutton==0.3.1
sphinx-panels==0.5.2
sphinx-thebe==0.0.8
sphinx-togglebutton==0.2.3
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-bibtex==2.1.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
sphinxcontrib-websupport @ file:///tmp/build/80754af9/sphinxcontrib-websupport_1597081412696/work
spyder @ file:///opt/concourse/worker/volumes/live/93f52c11-6bc0-49a8-541e-aa5e1de1eadc/volume/spyder_1599056974853/work
spyder-kernels @ file:///opt/concourse/worker/volumes/live/b4ec5b57-5b3c-42d0-7731-c0691f88ee81/volume/spyder-kernels_1599056790993/work
SQLAlchemy @ file:///opt/concourse/worker/volumes/live/0214475e-3c0a-49a9-6cb8-ab2d5c945bef/volume/sqlalchemy_1603812264100/work
srsly==2.4.3
stargazer==0.0.5
statsmodels==0.12.2
sympy @ file:///opt/concourse/worker/volumes/live/d5d0b33b-5c2f-493b-5b67-8149e5531868/volume/sympy_1605119535834/work
tables==3.6.1
tblib @ file:///tmp/build/80754af9/tblib_1597928476713/work
tenacity==8.0.1
termcolor==1.1.0
terminado==0.9.1
testpath==0.4.4
textblob==0.17.1
thinc==8.0.15
threadpoolctl==3.1.0
tifffile==2020.10.1
tokenize-rt==4.2.1
toml @ file:///tmp/build/80754af9/toml_1592853716807/work
tomli==2.0.1
toolz @ file:///tmp/build/80754af9/toolz_1601054250827/work
tornado==6.1
tqdm @ file:///tmp/build/80754af9/tqdm_1605303662894/work
traitlets @ file:///tmp/build/80754af9/traitlets_1602787416690/work
tweepy==4.7.0
typer==0.4.0
typing_extensions==4.0.1
tzlocal==2.1
uc-micro-py==1.0.1
ujson==1.35
unicodecsv==0.14.1
urllib3 @ file:///tmp/build/80754af9/urllib3_1603305693037/work
version-information==1.0.4
wasabi==0.9.1
watchdog @ file:///opt/concourse/worker/volumes/live/cc0ee7bb-1065-44c4-5867-0fd5d13729e0/volume/watchdog_1593447373245/work
watermark==2.2.0
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
webencodings==0.5.1
websocket-client==1.0.1
webweb==0.0.37
Werkzeug==1.0.1
widgetsnbextension==3.5.1
wrapt==1.11.2
wurlitzer @ file:///opt/concourse/worker/volumes/live/01a17f3d-eafe-4806-57a1-4b9ef5d1815f/volume/wurlitzer_1594753845129/work
xlrd==1.2.0
XlsxWriter @ file:///tmp/build/80754af9/xlsxwriter_1602692860603/work
xlwings==0.20.8
xlwt==1.3.0
xmltodict==0.12.0
yapf @ file:///tmp/build/80754af9/yapf_1593528177422/work
yarg==0.1.9
zict==2.0.0
zipp @ file:///tmp/build/80754af9/zipp_1604001098328/work
zope.event==4.5.0
zope.interface @ file:///opt/concourse/worker/volumes/live/de428e3b-00ba-4161-442e-b9e5d25e4219/volume/zope.interface_1602002489816/work

You can type in > following the !pip freeze command to save the list of Python packages to a requirements file named requirements.txt (details about the requirements.txt file are provided at the end of the notebook). The character > saves any command output to a file.

!pip freeze > requirements.txt

You can access the version information about a particular package (e.g., numpy or pandas) using the grep command. grep is a command-line tool which searches for a pattern (in our case, the word numpy and the word pandas) and prints each line that matches the pattern.

# Return installed libraries that contain
# the text 'numpy' or 'pandas' in their name
!pip freeze | grep numpy
!pip freeze | grep pandas

numpy @ file:///opt/concourse/worker/volumes/live/5572694e-967a-4c0c-52cf-b53d43e72de9/volume/numpy_and_numpy_base_1603491881791/work
numpydoc @ file:///tmp/build/80754af9/numpydoc_1605117425582/work

geopandas==0.9.0
pandas @ file:///opt/concourse/worker/volumes/live/f14cf8c4-c564-4eff-4b17-158e90dbf88a/volume/pandas_1602088128240/work

As of April 2021, Colab uses numpy 1.19.5 and pandas 1.2.0. It is possible to install and import a different version, for example the latest version of pandas (1.2.4) (released on 12 April 2021) by executing the following command:

!pip install pandas==1.2.4
import pandas as pd

For the pip approach for recording package dependencies to work, we need to keep track of each and all packages we use in a notebook. This may not always be the case, leaving open the possibility for undocumented dependencies in our notebook (when we fail to document one or more packages on which our data analysis depends on). To address this, you can use a notebook extension such as watermark which will print out a list of your dependencies that are explicitly imported/used in your notebook.

# Install the watermark extension
!pip install -q watermark

# Load the watermark extension
%load_ext watermark

# Show packages that were imported
%watermark --iversions

dowhy: 0.6

Once you identify the versions of all dependencies in your notebook, it is a good practice to list them at the bottom of your notebook. In addition, the information is often stored in a requirements.txt file on a GitHub repository. The file simply lists all of your package dependencies and their versions in the following format:

IPython==5.5.0
pandas==1.2.0
dowhy==0.6

You can use the package pipreqsnb to automatically save a list of all Python packages (and their versions) used in your current notebook to a file named requirements.txt. The file requirements.txt will be created in your working directory after you execute the command below.

# Install the pipreqsnb package
!pip install -q pipreqsnb

# Run pipreqsnb after specifying the path to the notebook
!pipreqsnb ../notebooks/03_open_reproducible_workflows.ipynb

pipreqs  --savepath ../notebooks/requirements.txt .//__temp_pipreqsnb_folder/

INFO: Successfully saved requirements file in ../notebooks/requirements.txt

The requirements file (requirements.txt) can be then used by tools like Binder to build a Docker container that recreates the computational environment (including packages and package versions) you have used in your data analysis, making your code immediately reproducible by others irrespective of their computational environment (e.g., operating system or software versions). Reproducing your computational environment is a key precondition for others to be able to reproduce your data analysis.

Python for Data Analysis on the Cloud Data Design & Data Wrangling

Reproducible Data Science + Python + Real-World Data