Textbook outline¶
The textbook provides an accessible hands-on introduction to data science techniques, skills, and workflows necessary to perform open, reproducible, and ethical data analysis. In the textbook, you will study research problems of real-world relevance, such as vaccine hesitancy and the impact of COVID-19 lockdown measures on human mobility. You will use real-world social data, including large-scale anonymised mobility data from digital sources and recent COVID-19 survey data.
You will engage with open reproducible data science workflows using open-source and free computational tools, including the Python programming language, Jupyter notebook in the Cloud, Markdown, version control, and the Open Science Framework. No software installation or setup is required as you will use cloud computing. Specifically, you will use Jupyter notebooks on your laptop or tablet (or even smartphone) via free cloud environments such as the Colaboratory (Colab for short) or MyBinder. Fortunately, the Python data science community have developed an open source ecosystem of libraries for data analysis, statistical modelling, machine learning, and network analysis, including the libraries pandas
, seaborn
, scikit-learn
, statsmodels
, and networkX
. This suite of libraries will allow us to analyse, visualise, and model data at easy, with minimal programming requirements, while focusing on reproducible analysis of social data.
Most of all, this textbook aims to contribute to inclusive, diverse, and supportive learning.
Key themes¶
Throughout the textbook, you will learn, in an accessible way, practical data science skills, including data wrangling, clustering, resampling, and visualisation of various data sources as well as applications of techniques from machine learning (e.g., cross-validation to mitigate the risk of overfitting), causal inference (e.g., causal graphs to detect confounding), and network analysis (e.g., community detection to discover tightly knit communities). The content is organised around four foundational data science tasks — (i) data preprocessing, (ii) description (and exploratory data analysis and visualisation), (iii) prediction, and (iv) causal inference (which includes counterfactuals). Attention is given to model evaluation and problems of overfitting, selection bias, confounding, and computational reproducibility. Throughout the textbook are discussed issues of data ethics, privacy, and fairness of data science models.
The textbook teaches you how to critically evaluate data and biases intrinsic to real-world data and real-time data (many of the COVID-19 data set we use are updated daily). Instead of looking for ‘positive’ results and ‘statistically significant’ relationships as a way of finding order in often disorderly data, you will learn open and reproducible workflow. In this workflow, you will describe your steps throughout the research process (not just your final results), make transparent choices of parameter selection, and document in your notebooks the results you have obtained, however ‘useful’ or ‘(un)expected’ they may seem.
Prerequisites¶
Prior knowledge of programming is not required as coding for data analysis will be taught from first principles. Background in mathematics or statistics are not required beyond basic algebra and descriptive statistics.
Who this textbook is for¶
The textbook would be ideal for students in the social sciences, public health, and related fields who want to study real-world problems using diverse data sets but lack data science knowledge and coding skill.
What is Reproducible Data Science?¶
The textbook is designed around an understanding of data science as the use of coding to draw conclusions from diverse data sets by solving five classes of tasks (see Ani Adhikari and John DeNero, 2020; Hernán et al. 2019):
Data preprocessing — preparing data for analysis using techniques for data cleaning, data wrangling, and data transformation.
Description — discovering patterns in data using exploratory data analysis, visualisation, and automated discovery techniques.
Prediction — using information about outcomes we know to make informed guesses about unknown outcomes by applying techniques from simple regression to (supervised) machine learning.
Causal data analysis — studying cause-and-effect questions via the application of causal graphs, counterfactuals, and causal inference techniques.
Inference — quantifying our degree of certainty to determine whether what we find in our data will hold among different scenarios using resampling methods and related techniques.
This textbook does not cover a single data science task in detail but introduces you to these tasks with a focus on real-world data and applications, hands-on computation, and reproducible data analysis.
In a typical Data Science Lifecycle, we will begin with a research question, and then select our data set(s), preprocess the data, perform descriptive analysis to explore basic features of our data, and then model our data to predict an outcome or establish causal effect. Throughout the data science lifecycle, transparency of research process and computational reproducibility are essential.
This understanding of data science is inspired by many, including the UC Berkeley’s data science program, particularly the courses Data 8: Foundations of Data Science and Data 100: Principles and Techniques of Data Science, and the associated textbooks, open educational resources, and communities; Matthew Salganik’s book Bit by Bit: Social Research in the Digital Age; The Summer Institutes in Computational Social Science by Chris Bail and Matt Salganik and associated learning resources; Ramesh Johari’s course at Stanford University Fundamentals of Data Science; The Turing Way Community and their Turing Way: A Handbook for Reproducible Data Science. Finally, the UC Berkeley’s 2020 Workshop on Data Science Education was instrumental in building both my vision and toolkit for democratising data science.
Learning objectives¶
By the end of the module, you will be able to:
Freely use computational tools — Python, Jupyter, Markdown — in the cloud to perform and report basic data analysis.
Wrangle, explore, visualize, and model tabular and network data using Python libraries.
Build a transparent and reproducible research workflow ranging from data loading to research report.
Perform, critically interpret, and openly communicate research process and results from analysis using basic models for machine learning and causal inference.
Identify and deal with issues of overfitting, selection bias, and confounding.
Articulate and address issues of data ethics and fairness of data science models in the social domain.
Write a clean, reusable, and reproducible code in Python.
Share your work and collaborate on research projects with others.
Learning resources¶
To accommodate students’ different styles of learning, I have assembled a range of resources, from books and articles to short video lectures and tutorials on coding and data analysis. You are welcome to focus on learning materials you personally find the most helpful. Below are listed some of the key readings and learning resource that can help you get started, and throughout the textbook I point to particular learning resources that are directly relevant to the lab’s topic.
Note
We distinguish four categories of learning resources by marking articles with a , books with a , videos with a , and tutorials with a . The categories are not mutually exclusive.
Foster, I., Ghani, R., Jarmin, R.S., Kreuter, F., Lane, J. 2020. Big Data and Social Science: Data Science Methods and Tools for Research and Practice (2nd edition). Chapman and Hall/CRC. [Online version freely available]
Book’s Jupyter notebooks with data, code, and practical programming exercises are freely available through Binder and GitHub.
Jake VanderPlas. 2016. Python Data Science Handbook. O’Reilly. [Online version freely available]
Book’s content is freely available in the form of Jupyter notebooks.
McKinney, W. 2017. Python for Data Analysis: Data Wrangling with Pandas, Numpy, and IPython (2nd edition). O′Reilly.
Book’s materials and Jupyter notebooks are freely available.
Daniel Chen. 2018. Pandas for Everyone: Python Data Analysis. Addison-Wesley Professional.
Freely available Jupyter notebooks on Pandas.
Introduction to Data Processing in Python with Pandas, SciPy 2019 tutorial by Daniel Chen.
Aurélien Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. O’Reilly.
Freely available Jupyter notebooks.
Reproducible Data Analysis in Jupyter by Jake Vanderplas.
Machine Learning with Scikit Learn by Jake VanderPlas.
The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986 [Freely available online guide]
Wasser, L. and Palomino, J. (Updated: September 03, 2020) Introduction to Earth Data Science. [Freely available online textbook]
Matthew Salganik. 2017. Bit by Bit: Social Research in the Digital Age. Princeton University Press. [Online version freely available]
Morgan, S. L. and Winship, C. 2014. Counterfactuals and Causal Inference (2nd edition). Cambridge University Press.
Cunningham, S. 2021. Causal Inference: The Mixtape. Yale University Press. [Online version, including Python code, freely available]
Kelleher, J. and Tierney, B. 2018. Data Science. MIT Press.
Pedro Saleiro, Kit T. Rodolfa, Rayid Ghani. Dealing with Bias and Fairness in Data Science Systems: A Practical Hands-on Tutorial.
Corresponding video tutorial, KDD 2020 tutorial.
Software stack¶
The textbook uses Python. Python is open source, freely available, and accessible general-purpose programming language. A great feature of Python (and other open-source programing languages) are the collaborative communities which have developed a diverse ecosystem of powerful libraries or tools for doing data science. Those open-source tools allow us to perform computational data analysis at easy while focusing on the understanding of our results and on their evaluation and implications. The open-source tools for data analysis we will use the most include pandas
for data loading, wrangling, and exploratory data analysis; seaborn
and Matplotlib
for data visualisation; scikit-learn
for prediction, pattern discovery and other machine learning tasks; and statsmodels
for statistical modelling. Many of these tools are built on top of NumPy
and SciPy
, two foundational libraries for scientific computing in Python.
We write Python code in Jupyter Notebooks. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
We use cloud services to run Jupyter Notebook. The cloud infrastructure provides access to pre-configured data science computational environments. Learners can open a notebook and interact with the code without the need of software installation and configuration. Specifically, learners can access the notebooks that form part of this learning resource via MyBinder or Colab. MyBinder is a free, public online service that runs Jupyter notebooks in an executable and reproducible environment, enabling interactive data analysis. The Colab is a free environment that runs Jupyter notebooks on the Google Cloud, enabling interactive data analysis.
Data sets¶
Below is a list of the main data sets we use in the textbook.
Note
We believe in open science and open data, and, therefore, the majority of the textbook uses open data sets. We also believe in responsible use of fine-resolution social data for data science education with the understanding that such fine-resolution data may be safeguarded due to privacy restrictions and risk of disclosure. The textbook, therefore, uses a mixture of (mostly) open data and safeguarded data.
1. Mobility Data from Digital Sources [Open]¶
COVID-19 Google Community Mobility Reports
Aggregated and anonymised mobility trends data that protect individual privacy.
Displays human mobility trends by country and region across categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
Enables an exploration of changes in mobility trends as a response to non-pharmaceutical public health interventions (e.g., lockdowns, school closure) designed to reduce the spread of COVID-19.
2. Survey Data [Safeguarded]¶
The Understanding Society: COVID-19 Study
UK survey asking participants about their experiences during the COVID-19 pandemic. We use Wave 6 of the survey.
The data are safeguarded and can be accessed via the UK Data Service.
3. Administrative Data on COVID-19 [Open]¶
Our World in Data (OWID) data on COVID-19 confirmed cases, deaths, hospitalizations, testing, and vaccinations reported by governments and international organizations.
In addition to the above three data sets, we explore in exercises various other data sets related to COVID-19, including the World Health Organisation (WHO) COVID-19 Global Data and Apple’s COVID-19 Mobility Trends Reports.
How to get the most of the open learning resource?¶
You can read the learning resource in its entirety on this website. You can also view each individual notebook on GitHub by clicking on the respective button below.
To interactively work with the code, you can access the interactive versions of the Jupyter notebooks via the free cloud services MyBinder and Colab.
By clicking on a button below, you will launch the Jupyter notebooks in a cloud instance. MyBinder will open the notebooks in a reproducible computational environment (i.e., an environment that contains pre-installed the Python packages used in the original notebooks) from where you can interactively run and modify the code in your browser. MyBinder is a free, public cloud service. No setup or a login is required to execute the notebooks.
By clicking on a you will open a Jupyter notebook on Google Cloud. Once open in your browser, you can interactively run and modify the notebook. Colab is a free cloud service that requires no setup. You can view the notebooks without a login but to execute and modify a notebook, a Google account and a login are required.
Textbook chapter |
View on GitHub |
Launch on myBinder.org |
Open in Colab |
---|---|---|---|
Acknowledgment¶
I am grateful to many people and communities who helped with discussions, advice, and open-source teaching materials (All errors, of course, are mine.):
Matthew Brett
Matthew Salganik
Chris Bail
Rayid Ghani
Scott Cunningham
Mason A Porter
Bernie Hogan
Adam Dennett
Jake Vanderplas
Daniel Chen
James Allen-Robertson
Fernando Pérez
Chris Holdgraf
Sharad Goel
Dani Arribas-Bel
Data Science Education at Berkeley
The Turing Way Community
Earth Lab CU-Boulder
DataCamp for Classrooms
Berkeley Initiative for Transparency in the Social Sciences (BITSS)
Meta-Research Innovation Center at Stanford (METRICS)
The Carpentries
Tiffany Timbers
Ben Marwick
Jens Lechtenbörger
Tom Donoghue
I also thank the participants at the 2021 National Workshop on Data Science Education (organised by UC Berkeley’s Division of Computing, Data Science, and Society), third year students at the Department of Sociology, University of Essex, who studied elements of the learning resource in the Spring term of the academic years 2020–21 and 2021–22, and students at the Research Transparency and Reproducibility Training (RT2) Virtual 2021 (organised by Berkeley Initiative for Transparency in the Social Sciences (BITSS)) for their helpful feedback and kindness. This feedback from a student captures it well: “i was genuinely terrified when the term started and i saw coding and python but this has been great thank you!”. I also thank Kirils Makarovs and Hamid Nejadghorban for assistance with teaching earlier iterations of the learning resource.
Finally, thanks to the open data science Twitter community for helpful feedback, discussions, and pointers to resources.
Code availability¶
All materials, code, and data included in this textbook as well as this textbook website are available as a public GitHub repository at https://github.com/valdanchev/reproducible-data-science-python
License¶
Reproducible Data Science with Python by Valentin Danchev is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.