The textbook provides an accessible hands-on introduction to data science techniques, skills, and workflows necessary to perform open, reproducible, and ethical data analysis. In the textbook, you will study research problems of real-world relevance, such as vaccine hesitancy and the impact of COVID-19 lockdown measures on human mobility. You will use real-world social data, including large-scale anonymised mobility data from digital sources and recent COVID-19 survey data.
You will engage with open reproducible data science workflows using open-source and free computational tools, including the Python programming language, Jupyter notebook in the Cloud, Markdown, version control, and the Open Science Framework. No software installation or setup is required as you will use cloud computing. Specifically, you will use Jupyter notebooks on your laptop or tablet (or even smartphone) via free cloud environments such as the Google Colaboratory (Colab) or JupyterHub. Fortunately, the Python data science community have developed an open source ecosystem of libraries for data analysis, statistical modelling, machine learning, and network analysis, including the libraries
networkX. This suite of libraries will allow us to analyse, visualise, and model data at easy, with minimal programming requirements, while focusing on reproducible analysis of social data.
Most of all, this textbook aims to contribute to inclusive, diverse, and supportive learning.
Throughout the textbook, you will learn, in an accessible way, practical data science skills, including data wrangling, clustering, resampling, and visualisation of various data sources as well as applications of techniques from machine learning (e.g., cross-validation to mitigate the risk of overfitting), causal inference (e.g., causal graphs to detect confounding), and network analysis (e.g., community detection to discover tightly knit communities). The content is organised around four foundational data science tasks — (i) data preprocessing, (ii) description (and exploratory data analysis and visualisation), (iii) prediction, and (iv) causal inference (which includes counterfactuals). Attention is given to model evaluation and problems of overfitting, selection bias, confounding, and computational reproducibility. Throughout the textbook are discussed issues of data ethics, privacy, and fairness of data science models.
The textbook teaches you how to critically evaluate data and biases intrinsic to real-world data and real-time data (many of the COVID-19 data set we use are updated daily). Instead of looking for ‘positive’ results and ‘statistically significant’ relationships as a way of finding order in often disorderly data, you will learn open and reproducible workflow. In this workflow, you will describe your steps throughout the research process (not just your final results), make transparent choices of parameter selection, and document in your notebooks the results you have obtained, however ‘useful’ or ‘(un)expected’ they may seem.
Prior knowledge of programming is not required as coding for data analysis will be taught from first principles. Background in mathematics or statistics are not required beyond basic algebra and descriptive statistics.
Who this textbook is for¶
The textbook would be ideal for students in the social sciences, public health, and related fields who want to study real-world problems using diverse data sets but lack data science knowledge and coding skill.
What is Reproducible Data Science?¶
The textbook is designed around an understanding of data science as the use of coding to draw conclusions from diverse data sets by solving five classes of tasks (see Ani Adhikari and John DeNero, 2020; Hernán et al. 2019):
Data preprocessing — preparing data for analysis using techniques for data cleaning, data wrangling, and data transformation.
Description — discovering patterns in data using exploratory data analysis, visualisation, and automated discovery techniques.
Prediction — using information about outcomes we know to make informed guesses about unknown outcomes by applying techniques from simple regression to (supervised) machine learning.
Causal data analysis — studying cause-and-effect questions via the application of causal graphs, counterfactuals, and causal inference techniques.
Inference — quantifying our degree of certainty to determine whether what we find in our data will hold among different scenarios using resampling methods and related techniques.
This textbook does not cover a single data science task in detail but introduces you to these tasks with a focus on real-world data and applications, hands-on computation, and reproducible data analysis.
In a typical Data Science Lifecycle, we will begin with a research question, and then select our data set(s), preprocess the data, perform descriptive analysis to explore basic features of our data, and then model our data to predict an outcome or establish causal effect. Throughout the data science lifecycle, transparency of research process and computational reproducibility are essential.
This understanding of data science is inspired by many, including the UC Berkeley’s data science program, particularly the courses Data 8: Foundations of Data Science and Data 100: Principles and Techniques of Data Science, and the associated textbooks, open educational resources, and communities; Matthew Salganik’s book Bit by Bit: Social Research in the Digital Age; The Summer Institutes in Computational Social Science by Chris Bail and Matt Salganik and associated learning resources; Ramesh Johari’s course at Stanford University Fundamentals of Data Science; The Turing Way Community and their Turing Way: A Handbook for Reproducible Data Science. Finally, the UC Berkeley’s 2020 Workshop on Data Science Education was instrumental in building both my vision and toolkit for democratising data science.
By the end of the module, you will be able to:
Freely use computational tools — Python, Jupyter, Markdown — in the cloud to perform and report basic data analysis.
Wrangle, explore, visualize, and model tabular and network data using Python libraries.
Build a transparent and reproducible research workflow ranging from data loading to research report.
Perform, critically interpret, and openly communicate research process and results from analysis using basic models for machine learning and causal inference.
Identify and deal with issues of overfitting, selection bias, and confounding.
Articulate and address issues of data ethics and fairness of data science models in the social domain.
Write a clean, reusable, and reproducible code in Python.
Share your work and collaborate on research projects with others.
How to get the most of these materials?¶
You can read the materials on this website and learn about various topics in reproducible data science. If you prefer more hands-on coding, you can access the interactive versions of the Python Jupyter notebooks via Binder and Colab.
Binder is a free open online service that lets you open and execute Jupyter notebooks and work with the code interactively.
Colab is a freely available service that allows you to run your Python Jupyter notebook on the Google Cloud, making it possible to interactively write and execute Python code in your browser. Colab differs from Jupyter in some respects (e.g., keyboard shortcuts). The code on this website should work in Colab as early versions of many of the notebooks were designed in Colab (some components were designed for Jupyter Book and may appear differently in Colab). For each topic, you can click on and work with the notebook interactively:
To accommodate students’ different styles of learning, I have assembled a range of resources, from books and articles to short video lectures and tutorials on coding and data analysis. You are welcome to focus on learning materials you personally find the most helpful. Below are listed some of the key readings and learning resource that can help you get started, and throughout the textbook I point to particular learning resources that are directly relevant to the lab’s topic.
We distinguish four categories of learning resources by marking articles with a , books with a , videos with a , and tutorials with a . The categories are not mutually exclusive.
Foster, I., Ghani, R., Jarmin, R.S., Kreuter, F., Lane, J. 2020. Big Data and Social Science: Data Science Methods and Tools for Research and Practice (2nd edition). Chapman and Hall/CRC. [Online version freely available]
Book’s Jupyter notebooks with data, code, and practical programming exercises are freely available through Binder and GitHub.
McKinney, W. 2017. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and IPython (2nd edition). O′Reilly.
Book’s materials and Jupyter notebooks are freely available.
Daniel Chen. 2018. Pandas for Everyone: Python Data Analysis. Addison-Wesley Professional.
Freely available Jupyter notebooks on Pandas.
Introduction to Data Processing in Python with Pandas, SciPy 2019 tutorial by Daniel Chen.
Aurélien Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly.
Freely available Jupyter notebooks.
Reproducible Data Analysis in Jupyter by Jake Vanderplas.
Machine Learning with Scikit Learn by Jake VanderPlas.
The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986 [Freely available online guide]
Wasser, L. and Palomino, J. (Updated: September 03, 2020) Introduction to Earth Data Science. [Freely available online textbook]
Matthew Salganik. 2017. Bit by Bit: Social Research in the Digital Age. Princeton University Press. [Online version freely available]
Morgan, S. L. and Winship, C. 2014. Counterfactuals and Causal Inference (2nd edition). Cambridge University Press.
Cunningham, S. 2021. Causal Inference: The Mixtape. Yale University Press. [Online version, including Python code, freely available]
Kelleher, J. and Tierney, B. 2018. Data Science. MIT Press.
Pedro Saleiro, Kit T. Rodolfa, Rayid Ghani. Dealing with Bias and Fairness in Data Science Systems: A Practical Hands-on Tutorial.
Corresponding video tutorial, KDD 2020 tutorial.
The textbook uses Python. Python is open source, freely available, and accessible general-purpose programming language. A great feature of Python (and other open-source programing languages) are the collaborative communities which have developed a diverse ecosystem of powerful libraries or tools for doing data science. Those open-source tools allow us to perform computational data analysis at easy while focusing on the understanding of our results and on their evaluation and implications. The open-source tools for data analysis we will use the most include
pandas for data loading, wrangling, and exploratory data analysis;
Matplotlib for data visualisation;
scikit-learn for prediction, pattern discovery and other machine learning tasks; and
statsmodels for statistical modelling. Many of these tools are built on top of
SciPy, two foundational libraries for scientific computing in Python.
We write Python code in Jupyter Notebook. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
We run our Jupyter notebooks on the cloud. The cloud computing infrastructure gives students access to pre-configured data science computational environments, data, and learning resources without the need of software installation and configuration. Depending on the number of students, available resources, and other factors, students will access their Jupyter notebooks either via the JupyterHub or Google Colaboratory. JupyterHub is part of the Jupyter Project and supports access for many users to Jupyter notebooks at a public URL. The Google Colaboratory or Colab for short is a free Jupyter notebook environment that runs on the Google Cloud and requires no setup.
Below is a list of the main data sets we use in the textbook.
We believe in open science and open data, and, therefore, the majority of the textbook uses open data sets. We also believe in responsible use of fine-resolution social data for data science education with the understanding that such fine-resolution data may be safeguarded due to privacy restrictions and risk of disclosure. The textbook, therefore, uses a mixture of (mostly) open data and safeguarded data.
1. Mobility Data from Digital Sources [Open]¶
COVID-19 Google Community Mobility Reports
Aggregated and anonymised mobility trends data that protect individual privacy.
Displays human mobility trends by country and region across categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
Enables an exploration of changes in mobility trends as a response to non-pharmaceutical public health interventions (e.g., lockdowns, school closure) designed to reduce the spread of COVID-19.
2. Survey Data [Safeguarded]¶
UK survey asking participants about their experiences during the COVID-19 pandemic.
The data are safeguarded and can be accessed via the UK Data Service.
3. Administrative Data on COVID-19 [Open]¶
Our World in Data (OWID) data on COVID-19 confirmed cases, deaths, hospitalizations, testing, and vaccinations reported by governments and international organizations.
In addition to the above three data sets, we explore in exercises various other data sets related to COVID-19, including the World Health Organisation (WHO) COVID-19 Global Data and Apple’s COVID-19 Mobility Trends Reports.
I am grateful to many people and communities who helped with discussions, advice, and open-source teaching materials (All errors, of course, are mine.):
Mason A Porter
Data Science Education at Berkeley
The Turing Way Community
Earth Lab CU-Boulder
DataCamp for Classrooms
Berkeley Initiative for Transparency in the Social Sciences (BITSS)
Meta-Research Innovation Center at Stanford (METRICS)
I used elements of this textbook to teach third year students at the Department of Sociology, University of Essex in Spring 2021 and I would like to thank the GTA Kirils Makarovs and all the students for their hard work and kindness. This feedback from a student captures it well: “i was genuinely terrified when the term started and i saw coding and python but this has been great thank you!
All materials, code, and data included in this textbook as well as this textbook website are available as a public GitHub repository at https://github.com/valdanchev/reproducible-data-science-python
Reproducible Data Science: Accessible Data Analysis with Open Source Python Tools and Real-World Data by Valentin Danchev is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.