Open Reproducible Data Science Workflow¶
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”¶
The role of transparency and reproducible workflow in scientific research.
The reproducibility problem.
Plausible sources of the reproducibility problem:
Biases in data analysis, including p-hacking, HARKing (Hypothesizing After Results are Known), and publication bias
Lacking transparency of computer code, data, and materials.
Rules for open and reproducible workflows
Open-source tools for reproducible research, including
Markdown, dependency management via
pip, version control.
Regina Nuzzo. How scientists fool themselves – and how they can stop. Nature.
Christie Aschwanden. Science Isn’t Broken—It’s just a hell of a lot harder than we give it credit for. FiveThirtyEight.
Marcus Munafò et al. (2017) A manifesto for reproducible science. Nature Human Behaviour.
Jeffrey Perkel (2018) Why Jupyter is data scientists’ computational notebook of choice. Nature.
Tom Hardwicke et al. (2019) Calibrating the scientific ecosystem through meta-research. Annual Review of Statistics and Its Application.
Garret Christensen, Jeremy Freese, Edward Miguel (2019) Chapter 11: Reproducible Workflow. In Transparent and Reproducible Social Science Research: How to Do Open Science. University of California Press.
Reproducibility: The Basics. (With Brian Nosek)
Reproducible Data Analysis in Jupyter. Jake Vanderplas.
Getting Started With the Open Science Framework (OSF). Center for Open Science.
Introduction to Open Reproducible Science Workflows. Earth Lab CU Boulder.
Welcome To Colaboratory. Google Colaboratory.
Markdown in Jupyter Notebook. DataCamp.
Markdown Guide. Google Colaboratory.
The Markdown Guide. Matt Cone and collaborators.
What is computational reproducibility and why it matters?¶
The terms of reproducibility (and computational reproducibility) and replicability are sometimes used interchangeably but they differ.
Reproducibility means “obtaining consistent results using the same input data, computational steps, methods, and conditions of analysis”; it is synonymous with computational reproducibility.
Replicability “means obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”
Why does reproducibility and replicability matter? “Reproducibility and replicability are often cited as hallmarks of good science. Being able to reproduce the computational results of another researcher starting with the same data and replicating a previous study to test its results facilitate the self-correcting nature of science.”
Why do we study computational reproducibility right now? “Computational reproducibility is more prominent now than ever because of the growth in reliance on computing across all of science. When a researcher reports a study and makes the underlying data and code available, those results should be computationally reproducible by another researcher.”
Our focus is on the reproducibility of our data analysis workflow, and on computational reproducibility in particular.
Open Reproducible Workflow in Jupyter/Colab Notebooks¶
An important aspect of reproducible research is the integration of various components, including data gathering, data manipulation, data analysis and outputs in an open research workflow.
The Jupyter and Colab notebooks are an open-source web applications that allow you to create and share documents that contain code, equations, visualisations and narrative text. While a popular tool for data exploration, the notebook can also support your reproducible research workflow by integrating executable code, data inputs, results, and documentation within a single notebook, along with images, HTML, LaTeX, videos and more.
In the previous session, you have learned how to use the Jupyter / Colab notebook to:
run code interactively using the
Pythonprogramming language and
document your code and outputs using
Markdown, an open and easy-to-use markup language for creating formatted text.
Python code and
Markdown documentation within Jupyter/Colab notebooks facilitates open and reproducible science workflow through integration of various components — data inputs, code for data manipulation, analysis, visualisation, and results — within a single file that can be openly shared and communicated with others.
The Jupyter/Colab notebooks support reproducibility but do not guarantee it. In fact, a recent study of 10 million Jupyter notebooks hosted on GitHub have found that 36 per cent of the notebooks could not be reproduced because the code cells were not originally executed in a linear order. The tool is not sufficient for reproducible data analysis. We also need a reproducible research workflow that helps us transition from “nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code” (Jake VanderPlas, 2017).
Figure 1. Open and reproducible scientific workflow using Jupyter notebook and related open-source tools. Source: Juliette Taka and Nicolas M. Thiery. Publishing reproducible logbooks explainer comic strip. Zenodo. DOI: 10.5281/zenodo.4421040 (2018).
Rules for Reproducible Workflow¶
Document the process, not just the results — “… make sure to document all your explorations, even (or perhaps especially) those that led to dead ends. These comments will help you remember what you did and … why you chose a particular parameter value, where you copied a block of code from, or what you found interesting about an intermediate result.”
For an example, see Section.
Use cell divisions to make steps clear — “… try to make each cell in your notebook perform one meaningful step of the analysis that is easy to understand from the code in the cell or the surrounding markdown description. Modularize your code by cells and label the cells with markdown above the cell. Think of each cell as being one paragraph, having one function, or accomplishing one task (for example, create a plot).”
For an example, see Section.
Modularize code — “It is always good practice to avoid duplicate code, but in notebooks, it is especially easy to copy a cell, tweak a few lines, paste the resulting code into a new cell or another notebook, and run it again. This form of experimentation is expedient but makes notebooks difficult to read and nearly impossible to maintain if you want to change the functionality of or fix a bug in the copied code. Instead, wrap code you are about to copy and reuse in a function, which you can then call from as many cells as desired.”
For an example, see Section.
Record dependencies — “Rerunning your analysis in the future will require accessing not only your code but also any module or library that your code relied on.”
For an example, see Section.
Use version control — “Version control is a critical adjunct to notebook use because the interactive nature of notebooks makes it easy to accidentally change or delete important content. Furthermore, since notebooks contain code and code inevitably contains bugs, being able to determine the history of when a given bug you have discovered was introduced to the code versus when it was fixed—and thus what analyses it may have affected—is a key capability in scientific computation.”
Git and GitHub are widely used tools for version control [see Ten Simple Rules for Taking Advantage of Git and GitHub (Perez-Riverol et al, 2016) but are known to have a steep learning curve. For version control, we will use the built-in Colab functionality Revision history, which can be accessed from
Design your notebooks to be read, run, and reused — “… store your notebooks in a public code repository with a clear README file and a liberal open source license granting permission to reuse your code.”
Share and explain your data — “Having access to a clearly annotated notebook is of little use to those wanting to reproduce or extend your results if the underlying data are locked away. Strive to make your data or a sample of your data publicly available along with the notebook.”
Reproducible research report¶
To create a reproducible research report, use throughout the notebook:
Python code in Code cells
#in Code cells to introduce a comment line describing your Python code. Code commenting is a very important part of computational data analysis.
Markdown language in Text cells to write up your methods, results, and interpretation.
At the end of a session, rerun your notebook from top to bottom using
Restart and run all (under
Runtime in the Colab menu bar) to ensure computational reproducibility.
Throughout the textbook, we will refer to the rules for reproducible research workflow. We illustrate below a few simple rules (e.g., comment your code; use cell division to make your steps clear) with a focus on recording software dependencies, which is a key prerequisite for computational reproducibility.
Reproducing your data analysis in the future will require reusing not only your data and code but also any module and library as well as their respective versions that you employed in your code. It is a good practice to record those dependencies so that others or your future self (i.e., you in a month’s time) can recreate the environment underlying your analysis.
Let’s first determine your Python version. You can check your Python version in a Code cell by typing
!python --version. Note the use of the exclamation mark
! in front of the code. Any command appearing after the mark in the line will not be executed from the Python environment but from the underlying operating system. You can interact with your operating system from the command line, so you can think of the exclamation mark
! as introducing command-line interface (You can learn more about the command-line interface from this tutorial by The Carpentries).
# Check Python version !python --version
As of April 2021, Colab supports Python 3.7.10.
To install and manage Python software modules or libraries, you can use a package-management system, such as
pip, you can download and install a specific version of a module/library you plan to use in your data analysis. For example, you can install the library for causal inference
DoWhy version 0.6 released on 03 March 2021 by typing
!pip install doWhy==0.6. Note that we use two consecutive equal marks
== called the equality operator instead of a single equal marks
= which is an assignment operator in Python.
# Install and import a Python library !pip install dowhy==0.6 import dowhy
Requirement already satisfied: dowhy==0.6 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (0.6) Requirement already satisfied: scikit-learn in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (0.24.2) Requirement already satisfied: pydot>=1.4 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.4.2) Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: networkx>=2.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (2.5) Requirement already satisfied: scipy in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.5.2) Requirement already satisfied: pandas>=0.24 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.1.3) Requirement already satisfied: sympy>=1.4 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.6.2) Requirement already satisfied: statsmodels in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (0.12.1)
Requirement already satisfied: decorator>=4.3.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from networkx>=2.0->dowhy==0.6) (5.0.9) Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: python-dateutil>=2.7.3 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24->dowhy==0.6) (2.8.1) Requirement already satisfied: pytz>=2017.2 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24->dowhy==0.6) (2020.4) Requirement already satisfied: pyparsing>=2.1.4 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from pydot>=1.4->dowhy==0.6) (2.4.7) Requirement already satisfied: six>=1.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->dowhy==0.6) (1.15.0) Requirement already satisfied: joblib>=0.11 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn->dowhy==0.6) (0.17.0) Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: scipy in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.5.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn->dowhy==0.6) (2.1.0)
Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: scipy in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.5.2) Requirement already satisfied: patsy>=0.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from statsmodels->dowhy==0.6) (0.5.1) Requirement already satisfied: pandas>=0.24 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.1.3) Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: six>=1.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.24->dowhy==0.6) (1.15.0) Requirement already satisfied: numpy>=1.15 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from dowhy==0.6) (1.19.2) Requirement already satisfied: mpmath>=0.19 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from sympy>=1.4->dowhy==0.6) (1.1.0)
If many modules and libraries are already preinstalled, as in Colab, you can use
pip to record the specific version of those modules and libraries. For example, the command
!pip freeze returns installed modules and libraries listed in alphabetical order.
# List installed packages - listed are both packages available with Colab and packages we have installed in this session (e.g. the library DoWhy) !pip freeze
alabaster==0.7.12 anaconda-client==1.7.2 anaconda-navigator==1.10.0 anaconda-project==0.8.3 anyio==3.1.0 applaunchservices==0.2.1 appnope @ file:///opt/concourse/worker/volumes/live/5f13e5b3-5355-4541-5fc3-f08850c73cf9/volume/appnope_1606859448618/work appscript @ file:///opt/concourse/worker/volumes/live/50ca4c96-3090-40bb-6981-3a6114ed0af4/volume/appscript_1594840187551/work argh==0.26.2 argon2-cffi @ file:///opt/concourse/worker/volumes/live/59af29ac-4890-416e-7ab7-794f8d6f7ecd/volume/argon2-cffi_1596828548321/work asn1crypto @ file:///tmp/build/80754af9/asn1crypto_1596577642040/work astroid @ file:///opt/concourse/worker/volumes/live/21fd14a9-2a7e-484b-7394-5a9912cdcf80/volume/astroid_1592498459180/work astropy @ file:///opt/concourse/worker/volumes/live/37fbd2b2-7bed-485f-777f-7939107df919/volume/astropy_1606922928626/work async-generator==1.10 atomicwrites==1.4.0 attrs @ file:///tmp/build/80754af9/attrs_1604765588209/work autopep8 @ file:///tmp/build/80754af9/autopep8_1596578164842/work Babel @ file:///tmp/build/80754af9/babel_1605108370292/work backcall==0.2.0 backports.functools-lru-cache @ file:///tmp/build/80754af9/backports.functools_lru_cache_1605305165209/work backports.shutil-get-terminal-size==1.0.0 backports.tempfile==1.0 backports.weakref==1.0.post1 beautifulsoup4 @ file:///tmp/build/80754af9/beautifulsoup4_1601924105527/work bio==0.0.6 biopython==1.78 bitarray @ file:///opt/concourse/worker/volumes/live/fdfca23e-4dd8-48f7-512d-c4f3db552eeb/volume/bitarray_1605065128338/work bkcharts==0.2 bleach @ file:///tmp/build/80754af9/bleach_1600439572647/work bokeh @ file:///opt/concourse/worker/volumes/live/b2253281-9b72-4dcb-624e-e22924b50435/volume/bokeh_1603297849453/work boto==2.49.0 Bottleneck==1.3.2 brotlipy==0.7.0 certifi==2020.12.5 cffi @ file:///opt/concourse/worker/volumes/live/730e9a28-66f9-4e03-51ad-252ec8e40d81/volume/cffi_1606255126408/work chardet @ file:///opt/concourse/worker/volumes/live/a5a7c56b-23cb-471d-7404-f4130b8aed33/volume/chardet_1605303184529/work click==7.1.2 cloudpickle @ file:///tmp/build/80754af9/cloudpickle_1598884132938/work clyent==1.2.2 colorama @ file:///tmp/build/80754af9/colorama_1603211150991/work conda==4.9.2 conda-build==3.20.5 conda-package-handling @ file:///opt/concourse/worker/volumes/live/a7e34989-4c54-4cb6-4156-4e58ee270730/volume/conda-package-handling_1603018121300/work conda-verify==3.4.2 contextlib2==0.6.0.post1 cryptography @ file:///opt/concourse/worker/volumes/live/62f9527a-361c-47e5-5709-227a4a523add/volume/cryptography_1605544472602/work cycler==0.10.0 Cython @ file:///opt/concourse/worker/volumes/live/c7485e3f-2096-4fd2-7e22-acdb1fbaa2c6/volume/cython_1605457627467/work cytoolz==0.11.0 dask @ file:///tmp/build/80754af9/dask-core_1602083700509/work decorator==5.0.9 defusedxml==0.6.0 delayed==0.11.0b1 diff-match-patch @ file:///tmp/build/80754af9/diff-match-patch_1594828741838/work distributed @ file:///opt/concourse/worker/volumes/live/bd66aa48-5cf5-4b60-6ed4-f204fff153f6/volume/distributed_1605066538557/work docopt==0.6.2 docutils==0.16 dowhy==0.6 entrypoints==0.3 et-xmlfile==1.0.1 fastcache==1.1.0 filelock==3.0.12 flake8 @ file:///tmp/build/80754af9/flake8_1601911421857/work Flask==1.1.2 fsspec @ file:///tmp/build/80754af9/fsspec_1602684995936/work funcy==1.15 future==0.18.2 gensim==3.8.3 gevent @ file:///opt/concourse/worker/volumes/live/e6b243ce-c4b8-40bb-4934-ef3bf1c512f2/volume/gevent_1601397552921/work ghp-import==2.0.1 gitdb==4.0.7 GitPython==3.1.14 glob2==0.7 gmpy2==2.0.8 greenlet @ file:///opt/concourse/worker/volumes/live/02d5d57d-1f11-4cf9-580a-19e679c78dc9/volume/greenlet_1600874049903/work h5py==2.10.0 HeapDict==1.0.1 hiredis==2.0.0 html5lib @ file:///tmp/build/80754af9/html5lib_1593446221756/work idna @ file:///tmp/build/80754af9/idna_1593446292537/work imageio @ file:///tmp/build/80754af9/imageio_1594161405741/work imagesize==1.2.0 imbalanced-learn==0.8.0 imblearn==0.0 importlib-metadata @ file:///tmp/build/80754af9/importlib-metadata_1602276842396/work iniconfig @ file:///tmp/build/80754af9/iniconfig_1602780191262/work intervaltree @ file:///tmp/build/80754af9/intervaltree_1598376443606/work ipykernel @ file:///opt/concourse/worker/volumes/live/88f541d3-5a27-498f-7391-f2e50ca36560/volume/ipykernel_1596206680118/work/dist/ipykernel-5.3.4-py3-none-any.whl ipython==7.23.1 ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work ipywidgets @ file:///tmp/build/80754af9/ipywidgets_1601490159889/work isort @ file:///tmp/build/80754af9/isort_1602603989581/work itsdangerous==1.1.0 jdcal==1.4.1 jedi==0.18.0 Jinja2==2.11.2 joblib @ file:///tmp/build/80754af9/joblib_1601912903842/work json5==0.9.5 jsonschema @ file:///tmp/build/80754af9/jsonschema_1602607155483/work jupyter==1.0.0 jupyter-book==0.10.2 jupyter-cache==0.4.2 jupyter-client @ file:///tmp/build/80754af9/jupyter_client_1601311786391/work jupyter-console @ file:///tmp/build/80754af9/jupyter_console_1598884538475/work jupyter-contrib-core==0.3.3 jupyter-contrib-nbextensions==0.5.1 jupyter-core @ file:///opt/concourse/worker/volumes/live/e8302867-5dbe-440b-7a37-f61bacc38ed8/volume/jupyter_core_1606148993907/work jupyter-highlight-selected-word==0.2.0 jupyter-latex-envs==1.4.6 jupyter-nbextensions-configurator==0.4.1 jupyter-server==1.8.0 jupyter-sphinx==0.3.1 jupyterbook-latex==0.2.0 jupyterlab==3.0.0 jupyterlab-pygments @ file:///tmp/build/80754af9/jupyterlab_pygments_1601490720602/work jupyterlab-server==2.5.2 jupyterlab-spellchecker==0.5.2 jupytext==1.10.3 keyring @ file:///opt/concourse/worker/volumes/live/54fc3ec2-338b-44f5-5e13-d62afa6b5820/volume/keyring_1601490916376/work kiwisolver @ file:///opt/concourse/worker/volumes/live/b8936fa6-0e4b-47e7-4fb4-e02dbd4505ee/volume/kiwisolver_1604014598721/work latexcodec==2.0.1 lazy-object-proxy==1.4.3 libarchive-c==2.9 linkify-it-py==1.0.1 llvmlite==0.34.0 locket==0.2.0 lxml @ file:///opt/concourse/worker/volumes/live/0c49af63-83fd-4e70-550a-65ad2757eabb/volume/lxml_1606516849441/work markdown-it-py==0.6.2 MarkupSafe @ file:///opt/concourse/worker/volumes/live/cb778296-98db-45ad-411e-6f726e102dc3/volume/markupsafe_1594371638608/work matplotlib @ file:///opt/concourse/worker/volumes/live/f7797860-f8aa-410c-4a56-72315954816b/volume/matplotlib-base_1603378002957/work matplotlib-inline==0.1.2 mccabe==0.6.1 mdit-py-plugins==0.2.6 mistune @ file:///opt/concourse/worker/volumes/live/95802d64-d39c-491b-74ce-b9326880ca54/volume/mistune_1594373201816/work mkl-fft==1.2.0 mkl-random==1.1.1 mkl-service==2.3.0 mock==4.0.2 more-itertools @ file:///tmp/build/80754af9/more-itertools_1605111547926/work mpmath==1.1.0 msgpack==1.0.0 multipledispatch==0.6.0 myst-nb==0.12.0 myst-parser==0.13.6 navigator-updater==0.2.1 nbclassic==0.3.1 nbclient @ file:///tmp/build/80754af9/nbclient_1602783176460/work nbconvert==5.6.1 nbdime==2.1.0 nbformat @ file:///tmp/build/80754af9/nbformat_1602783287752/work nest-asyncio @ file:///tmp/build/80754af9/nest-asyncio_1606153767164/work nested-lookup==0.2.22 networkx @ file:///tmp/build/80754af9/networkx_1598376031484/work netwulf==0.1.5 nltk @ file:///tmp/build/80754af9/nltk_1592496090529/work nose @ file:///tmp/build/80754af9/nose_1606773131901/work notebook @ file:///opt/concourse/worker/volumes/live/be0f3504-189d-4bae-4e57-c5d6da73ffcd/volume/notebook_1601501605350/work numba @ file:///opt/concourse/worker/volumes/live/ae24c1ca-d916-4043-5919-a843fa33e451/volume/numba_1600084276085/work numexpr==2.7.1 numpy @ file:///opt/concourse/worker/volumes/live/5572694e-967a-4c0c-52cf-b53d43e72de9/volume/numpy_and_numpy_base_1603491881791/work numpydoc @ file:///tmp/build/80754af9/numpydoc_1605117425582/work olefile==0.46 openpyxl @ file:///tmp/build/80754af9/openpyxl_1598113097404/work packaging @ file:///tmp/build/80754af9/packaging_1606930849755/work pandas @ file:///opt/concourse/worker/volumes/live/f14cf8c4-c564-4eff-4b17-158e90dbf88a/volume/pandas_1602088128240/work pandocfilters @ file:///opt/concourse/worker/volumes/live/c330e404-216d-466b-5327-8ce8fe854d3a/volume/pandocfilters_1605120442288/work parso==0.8.2 partd==1.1.0 path @ file:///opt/concourse/worker/volumes/live/fcdf620c-46d6-4284-4c1e-5b8c3bc6c5c6/volume/path_1596907417277/work pathlib2 @ file:///opt/concourse/worker/volumes/live/de518564-0d9f-405e-472b-38136f0c2169/volume/pathlib2_1594381084269/work pathtools==0.1.2 patsy==0.5.1 pep8==1.7.1 pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work pigar==0.10.0 Pillow @ file:///opt/concourse/worker/volumes/live/991b9a87-3372-4acd-45f9-eaa52701f03c/volume/pillow_1603822262543/work pipreqs==0.4.10 pipreqsnb==0.2.3 pkginfo==1.6.1 plac==1.2.0 pluggy==0.13.1 ply==3.11 prometheus-client @ file:///tmp/build/80754af9/prometheus_client_1606344362066/work prompt-toolkit==3.0.18 psutil @ file:///opt/concourse/worker/volumes/live/ff72f822-991c-4030-4f3a-8c41d3ac4e4f/volume/psutil_1598370232375/work ptyprocess==0.7.0 py @ file:///tmp/build/80754af9/py_1593446248552/work pybtex==0.24.0 pybtex-docutils==1.0.0 pycairo==1.20.0 pycodestyle==2.6.0 pycosat==0.6.3 pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work pycurl==18.104.22.168 pydata-sphinx-theme==0.4.3 pydocstyle @ file:///tmp/build/80754af9/pydocstyle_1598885001695/work pydot==1.4.2 pyerf==1.0.1 pyerfa @ file:///opt/concourse/worker/volumes/live/5caffc18-53e2-4c2a-5220-6f94c6152218/volume/pyerfa_1606860213217/work pyflakes==2.2.0 pygal==2.4.0 pygal-maps-world==1.0.2 Pygments==2.9.0 pyLDAvis==2.1.2 pylint @ file:///opt/concourse/worker/volumes/live/ed0164b6-bcc7-4f6b-7dd4-ad89660b5dcb/volume/pylint_1598624018129/work pyodbc===4.0.0-unsupported pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1606517880428/work pyparsing==2.4.7 pyrsistent @ file:///opt/concourse/worker/volumes/live/ff11f3f0-615b-4508-471d-4d9f19fa6657/volume/pyrsistent_1600141727281/work PySocks @ file:///opt/concourse/worker/volumes/live/85a5b906-0e08-41d9-6f59-084cee4e9492/volume/pysocks_1594394636991/work pytest==0.0.0 python-dateutil==2.8.1 python-igraph==0.7.1.post7 python-jsonrpc-server @ file:///tmp/build/80754af9/python-jsonrpc-server_1594397536060/work python-language-server @ file:///opt/concourse/worker/volumes/live/5f0313b4-ff69-4d9d-5c31-e01b223dabd6/volume/python-language-server_1594161914367/work python-louvain==0.14 pytz @ file:///tmp/build/80754af9/pytz_1606604771399/work pywaffle==0.6.1 PyWavelets @ file:///opt/concourse/worker/volumes/live/ea36e10f-66e8-43ae-511e-c4092764493f/volume/pywavelets_1601658378672/work PyYAML==5.3.1 pyzmq==20.0.0 QDarkStyle==2.8.1 QtAwesome @ file:///tmp/build/80754af9/qtawesome_1602272867890/work qtconsole @ file:///tmp/build/80754af9/qtconsole_1600870028330/work QtPy==1.9.0 redis==3.5.3 regex @ file:///opt/concourse/worker/volumes/live/c84a6349-3315-46e4-634b-b5582dea058b/volume/regex_1606691109605/work requests @ file:///tmp/build/80754af9/requests_1606691187061/work retrying==1.3.3 rope @ file:///tmp/build/80754af9/rope_1602264064449/work rpy2==3.3.6 Rtree==0.9.4 ruamel-yaml==0.15.87 scikit-image==0.17.2 scikit-learn==0.24.2 scikits.bootstrap==1.0.1 scipy @ file:///opt/concourse/worker/volumes/live/851446f6-a052-41c4-4243-67bb78999b49/volume/scipy_1604596178167/work seaborn @ file:///tmp/build/80754af9/seaborn_1600553570093/work Send2Trash==1.5.0 simplegeneric==0.8.1 simplejson==3.17.2 singledispatch @ file:///tmp/build/80754af9/singledispatch_1602523705405/work six @ file:///opt/concourse/worker/volumes/live/5b31cb27-1e37-4ca5-6e9f-86246eb206d2/volume/six_1605205320872/work smart-open==4.1.2 smmap==4.0.0 sniffio==1.2.0 snowballstemmer==2.0.0 sortedcollections==1.2.1 sortedcontainers @ file:///tmp/build/80754af9/sortedcontainers_1606865132123/work soupsieve==2.0.1 Sphinx @ file:///tmp/build/80754af9/sphinx_1597428793432/work sphinx-book-theme==0.0.42 sphinx-comments==0.0.3 sphinx-copybutton==0.3.1 sphinx-panels==0.5.2 sphinx-thebe==0.0.8 sphinx-togglebutton==0.2.3 sphinxcontrib-applehelp==1.0.2 sphinxcontrib-bibtex==2.1.4 sphinxcontrib-devhelp==1.0.2 sphinxcontrib-htmlhelp==1.0.3 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.3 sphinxcontrib-serializinghtml==1.1.4 sphinxcontrib-websupport @ file:///tmp/build/80754af9/sphinxcontrib-websupport_1597081412696/work spyder @ file:///opt/concourse/worker/volumes/live/93f52c11-6bc0-49a8-541e-aa5e1de1eadc/volume/spyder_1599056974853/work spyder-kernels @ file:///opt/concourse/worker/volumes/live/b4ec5b57-5b3c-42d0-7731-c0691f88ee81/volume/spyder-kernels_1599056790993/work SQLAlchemy @ file:///opt/concourse/worker/volumes/live/0214475e-3c0a-49a9-6cb8-ab2d5c945bef/volume/sqlalchemy_1603812264100/work stargazer==0.0.5 statsmodels @ file:///opt/concourse/worker/volumes/live/8cc21252-fe82-4d91-6eab-9ca11d929cbf/volume/statsmodels_1606865746867/work sympy @ file:///opt/concourse/worker/volumes/live/d5d0b33b-5c2f-493b-5b67-8149e5531868/volume/sympy_1605119535834/work tables==3.6.1 tblib @ file:///tmp/build/80754af9/tblib_1597928476713/work termcolor==1.1.0 terminado==0.9.1 testpath==0.4.4 threadpoolctl @ file:///tmp/tmp9twdgx9k/threadpoolctl-2.1.0-py3-none-any.whl tifffile==2020.10.1 toml @ file:///tmp/build/80754af9/toml_1592853716807/work toolz @ file:///tmp/build/80754af9/toolz_1601054250827/work tornado==6.1 tqdm @ file:///tmp/build/80754af9/tqdm_1605303662894/work traitlets @ file:///tmp/build/80754af9/traitlets_1602787416690/work typing-extensions @ file:///tmp/build/80754af9/typing_extensions_1598376058250/work tzlocal==2.1 uc-micro-py==1.0.1 ujson==1.35 unicodecsv==0.14.1 urllib3 @ file:///tmp/build/80754af9/urllib3_1603305693037/work watchdog @ file:///opt/concourse/worker/volumes/live/cc0ee7bb-1065-44c4-5867-0fd5d13729e0/volume/watchdog_1593447373245/work watermark==2.2.0 wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work webencodings==0.5.1 websocket-client==1.0.1 webweb==0.0.37 Werkzeug==1.0.1 widgetsnbextension==3.5.1 wrapt==1.11.2 wurlitzer @ file:///opt/concourse/worker/volumes/live/01a17f3d-eafe-4806-57a1-4b9ef5d1815f/volume/wurlitzer_1594753845129/work xlrd==1.2.0 XlsxWriter @ file:///tmp/build/80754af9/xlsxwriter_1602692860603/work xlwings==0.20.8 xlwt==1.3.0 xmltodict==0.12.0 yapf @ file:///tmp/build/80754af9/yapf_1593528177422/work yarg==0.1.9 zict==2.0.0 zipp @ file:///tmp/build/80754af9/zipp_1604001098328/work zope.event==4.5.0 zope.interface @ file:///opt/concourse/worker/volumes/live/de428e3b-00ba-4161-442e-b9e5d25e4219/volume/zope.interface_1602002489816/work
You can access the version information about a particular package (e.g., numpy or pandas) using the
grep is a command-line tool which searches for a pattern (in our case, the word
numpy and the word
pandas) and prints each line that matches the pattern.
# Return installed libraries that contain the text 'numpy' or 'pandas' in their name !pip freeze | grep numpy !pip freeze | grep pandas
numpy @ file:///opt/concourse/worker/volumes/live/5572694e-967a-4c0c-52cf-b53d43e72de9/volume/numpy_and_numpy_base_1603491881791/work numpydoc @ file:///tmp/build/80754af9/numpydoc_1605117425582/work
pandas @ file:///opt/concourse/worker/volumes/live/f14cf8c4-c564-4eff-4b17-158e90dbf88a/volume/pandas_1602088128240/work
As of April 2021, Colab uses
numpy 1.19.5 and
pandas 1.2.0. It is possible to install and import a different version, for example the latest version of pandas (1.2.4) (released on 12 April 2021) by executing the following command:
!pip install pandas==1.2.4 import pandas as pd
pip approach for recording package dependencies to work, we need to keep track of each and all packages we use in a notebook. This may not always be the case, leaving open the possibility for undocumented dependencies in our notebook (when we fail to document one or more packages on which our data analysis depends on). To address this, you can use a notebook extension such as
watermark which will print out a list of your dependencies that are explicitly imported/used in your notebook.
# Install the watermark extension !pip install watermark # Load the watermark extension %load_ext watermark # Show packages that were imported %watermark --iversions
Requirement already satisfied: watermark in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (2.2.0) Requirement already satisfied: ipython in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from watermark) (7.23.1) Requirement already satisfied: jedi>=0.16 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.18.0) Requirement already satisfied: traitlets>=4.2 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (5.0.5) Requirement already satisfied: pygments in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (2.9.0) Requirement already satisfied: setuptools>=18.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (57.0.0) Requirement already satisfied: pickleshare in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.7.5) Requirement already satisfied: matplotlib-inline in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.1.2) Requirement already satisfied: appnope in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.1.2)
Requirement already satisfied: pexpect>4.3 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (4.8.0) Requirement already satisfied: backcall in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.2.0) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (3.0.18) Requirement already satisfied: decorator in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (5.0.9) Requirement already satisfied: parso<0.9.0,>=0.8.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from jedi>=0.16->ipython->watermark) (0.8.2) Requirement already satisfied: traitlets>=4.2 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (5.0.5) Requirement already satisfied: ptyprocess>=0.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from pexpect>4.3->ipython->watermark) (0.7.0) Requirement already satisfied: wcwidth in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->watermark) (0.2.5) Requirement already satisfied: ipython-genutils in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from traitlets>=4.2->ipython->watermark) (0.2.0)
Once you identify the versions of all dependencies in your notebook, it is a good practice to list them at the bottom of your notebook. In addition, the information is often stored in a
requirements.txt file on a GitHub repository. The file simply lists all of your package dependencies and their versions in the following format:
IPython==5.5.0 pandas==1.2.0 dowhy==0.6
The dependency file can be then used by tools like Binder to build a Docker container that bundles the same packages and versions you have used in your data analysis, making your code immediately reproducible by others irrespective of their computational environment (e.g., operating system or software versions). Reproducing your computational environment is a precondition for others to be able to reproduce your data analysis.