Human mobility during UK lockdowns

End-to-End Data Science Project

In this first lab, you will go through the key steps in a data science project, starting from a research question, though data processing and analysis, to results that can potentially inform public policy. Specifically, we will go through five steps in the data science lifecycle:

  • formulate a research question of real-world relevance

  • obtain real-world data, wrangle, and transform data

  • explore and visualise data

  • generate descriptive findings that can inform decision-making

Later on in the course, we will cover in detail the techniques, code, and workflow we use in this session.

Let’s formulate our research question

How has human mobility differed across the three lockdowns in the United Kingdom during the COVID-19 pandemic?

Why is that research question important?

  • concerns many of us

  • is of public health policy relevance, and

  • involves large data analysis requiring modern computational technique and tools

Data to address the question

We will use a real-world and real-time (updated daily) data on human mobility — The Google Covid-19 Community Mobility Reports

An aggregated and anonymised large data set showing movement trends over time by geography, across six categories of places including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

Get started with Jupyter/Colab notebook

We will use the Jupyter notebook implemented on the Google Cloud, which is called Google Colaboratory (or Colab for short).

The Jupyter Notebook is a user-friendly, free, open-source web application that allows you to combine live software code, explanatory text, visualisations and model outputs in a single computational notebook.

Colab runs Jupyter notebook on the Google Cloud, allowing you to write and execute Python code in your browser and do scalable data analysis with no setup requirements.

You can learn more about the Colab and how to open a new notebook here.

The Python programming language

We will code using the Python programming language.

  • Python is open source, free, and easy to learn programming language

  • Python is one of world’s most popular programming language with a growing community

  • Python programming skills are in high demand on the job market

  • The Python data science community have developed an ecosystem of fast, powerful, and flexible open source tools for doing data science at scale.

Let’s get coding

The Colab notebook has two types of cells: code and text. You can add new cells by using the + Code and + Text buttons that are in the toolbar above the notebook and also appear when you are between a pair of cells.

Below is a code cell, in which we type in the arithmetic expression 21 + 21.

The code is prefixed by a comment. Commenting your code is a good research practice and part of your reproducible workflow. Comments in Python’s code cells start with a hashtag symbol # followed by a white space character and some text. The text that follows the hashtag symbol on the same line is marked as a comment and is not therefore evaluated by the Python interpreter. Only the code (in this instance, “21 + 21”) is evaluated and the output (in this instance, “42”) will be displayed below the code cell.

To execute the cell, press Shift + Enter or click the Play icon on the left.

# Performing a basic arithmetic operation of addition
21 + 21
42

Python reads the code entered in the cell, evaluates it, and prints the result (42).

Python tools for data analysis

The Python data science community have developed an open source ecosystem of libraries for data science.

We will use two main libraries:

  • pandas for data loading, wrangling, and analysis

  • seaborn for data visualisation

Think about those Python libraries as tools that allow you to do data science tasks at easy, with minimal programming requirements, while focusing on scalable and reproducible analysis of social data.

We first import the pandas library and, by convention, give it the alias pd.

# Import the pandas library
import pandas as pd

We can now access all the functions and capabilities the pandas library provides.

Load your data

The Google Covid-19 Community Mobility Reports data are provided as a comma-separated values (CSV) file. We load the CSV file into Python using the Pandas function read_csv().

What is a function?

A function is a block of code that:

  • takes input parameters

  • performs a specific task

  • returns an output.

The function read_csv() will take as an input parameter a comma-separated values (csv) file, read the file, and return Pandas DataFrame.

We call a function by writing the function name followed by parenthesis. The function read_csv() takes many input parameters, for example

  • sep — delimeter to use when reading the file; default is , but other possible delimeters include tab characters or space characters.

  • parse_dates — a column to be parsed as date and time.

Getting help when needed

To learn more about a function, in Colab, you use a question mark ?. For example, to access help information about the function Pandas function read_csv(), you type in

pd.read_csv?

Reading the Google Covid-19 Community Mobility Reports data

To read the Google Covid-19 Community Mobility Reports data, there is no need to download the file on your local computer. We just call the read_csv() function and specify the URL. The code below loads the most recent online version of the data. We also assign the loaded data set to a variable called mobility_trends.

# Loading the Google Covid-19 Community Mobility Reports data from web address (URL)
mobility_trends = pd.read_csv('https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv', parse_dates = ['date']) 

View your data

# Display the top five rows
mobility_trends.head(10)
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-15 0.0 4.0 5.0 0.0 2.0 1.0
1 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-16 1.0 4.0 4.0 1.0 2.0 1.0
2 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-17 -1.0 1.0 5.0 1.0 2.0 1.0
3 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-18 -2.0 1.0 5.0 0.0 2.0 1.0
4 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-19 -2.0 0.0 4.0 -1.0 2.0 1.0
5 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-20 -2.0 1.0 6.0 1.0 1.0 1.0
6 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-21 -3.0 2.0 6.0 0.0 -1.0 1.0
7 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-22 -2.0 2.0 4.0 -2.0 3.0 1.0
8 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-23 -1.0 3.0 3.0 -1.0 4.0 1.0
9 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-24 -3.0 0.0 5.0 -1.0 3.0 1.0

Pandas stores data as a DataFrame: 2-dimensional data structure in which variables are in columns, and observations are in rows.

Data Design: How are data generated?

  • The data shows percentage changes in visitors to (or time spent in) six categories of places compared to baseline days.

    • A baseline day represents a normal value for that day of the week.

    • The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.

Data Ethics, Privacy, and Fairness Risks

  • Low privacy risks: individual privacy is safeguarded as data is aggregated and anonymised.

  • Low Individual Fairness Risk but moderate Group Fairness Risk: areas with greater mobility during lockdowns may be misattributed to non-compliance while greater mobility could also be due to some groups being essential workers or another category that does not enjoy working from home’s privileges.

  • Sources of algorithmic confounding: the design of the Google Maps’ personalised recommendation system likely introduces mobility patterns into data but those would be very small at the geographic scale of the data (i.e., districts, counties).

Describe your data

# Number of rows and columns in your DataFrame  
mobility_trends.shape
(5648329, 15)
# Show a concise summary of your DataFrame
mobility_trends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5648329 entries, 0 to 5648328
Data columns (total 15 columns):
 #   Column                                              Dtype         
---  ------                                              -----         
 0   country_region_code                                 object        
 1   country_region                                      object        
 2   sub_region_1                                        object        
 3   sub_region_2                                        object        
 4   metro_area                                          object        
 5   iso_3166_2_code                                     object        
 6   census_fips_code                                    float64       
 7   place_id                                            object        
 8   date                                                datetime64[ns]
 9   retail_and_recreation_percent_change_from_baseline  float64       
 10  grocery_and_pharmacy_percent_change_from_baseline   float64       
 11  parks_percent_change_from_baseline                  float64       
 12  transit_stations_percent_change_from_baseline       float64       
 13  workplaces_percent_change_from_baseline             float64       
 14  residential_percent_change_from_baseline            float64       
dtypes: datetime64[ns](1), float64(7), object(7)
memory usage: 646.4+ MB

Access specific columns and rows in your data

# Access a column  
mobility_trends['country_region']
0          United Arab Emirates
1          United Arab Emirates
2          United Arab Emirates
3          United Arab Emirates
4          United Arab Emirates
                   ...         
5648324                Zimbabwe
5648325                Zimbabwe
5648326                Zimbabwe
5648327                Zimbabwe
5648328                Zimbabwe
Name: country_region, Length: 5648329, dtype: object

We are interested in the data about the United Kingdom.

mobility_trends['country_region'].unique()
array(['United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda',
       'Angola', 'Argentina', 'Austria', 'Australia', 'Aruba',
       'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium',
       'Burkina Faso', 'Bulgaria', 'Bahrain', 'Benin', 'Bolivia',
       'Brazil', 'The Bahamas', 'Botswana', 'Belarus', 'Belize', 'Canada',
       'Switzerland', "Côte d'Ivoire", 'Chile', 'Cameroon', 'Colombia',
       'Costa Rica', 'Cape Verde', 'Czechia', 'Germany', 'Denmark',
       'Dominican Republic', 'Ecuador', 'Estonia', 'Egypt', 'Spain',
       'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia',
       'Ghana', 'Greece', 'Guatemala', 'Guinea-Bissau', 'Hong Kong',
       'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland',
       'Israel', 'India', 'Iraq', 'Italy', 'Jamaica', 'Jordan', 'Japan',
       'Kenya', 'Kyrgyzstan', 'Cambodia', 'South Korea', 'Kuwait',
       'Kazakhstan', 'Laos', 'Lebanon', 'Liechtenstein', 'Sri Lanka',
       'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Moldova',
       'North Macedonia', 'Mali', 'Myanmar (Burma)', 'Mongolia', 'Malta',
       'Mauritius', 'Mexico', 'Malaysia', 'Mozambique', 'Namibia',
       'Niger', 'Nigeria', 'Nicaragua', 'Netherlands', 'Norway', 'Nepal',
       'New Zealand', 'Oman', 'Panama', 'Peru', 'Papua New Guinea',
       'Philippines', 'Pakistan', 'Poland', 'Puerto Rico', 'Portugal',
       'Paraguay', 'Qatar', 'Réunion', 'Romania', 'Serbia', 'Russia',
       'Rwanda', 'Saudi Arabia', 'Sweden', 'Singapore', 'Slovenia',
       'Slovakia', 'Senegal', 'El Salvador', 'Togo', 'Thailand',
       'Tajikistan', 'Turkey', 'Trinidad and Tobago', 'Taiwan',
       'Tanzania', 'Ukraine', 'Uganda', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'Yemen', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)
# Get the rows about United Kingdom and save it to its own variable
mobility_trends_UK = mobility_trends[mobility_trends['country_region'] == 'United Kingdom']
mobility_trends_UK
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
2068066 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
2068067 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
2068068 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
2068069 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
2068070 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2266232 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-31 20.0 14.0 222.0 -16.0 -74.0 13.0
2266233 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-06-01 21.0 31.0 177.0 -16.0 -50.0 9.0
2266234 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-06-02 15.0 23.0 157.0 -22.0 -49.0 10.0
2266235 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-06-03 14.0 24.0 133.0 -23.0 -47.0 10.0
2266236 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-06-04 -2.0 19.0 93.0 -20.0 -46.0 8.0

198171 rows × 15 columns

Exploratory data analysis

Let’s use the pandas method describe() to summarise the central tendency, dispersion and shape of our dataset’s distribution.

We summarise data from the start day of the first UK lockdown (2020-03-24) and from the start day of the third UK lockdown (2021-01-06). NaN (Not a Number) values are excluded.

# Compute descriptive statistics about the start day of first UK lockdown
mobility_trends_UK[mobility_trends_UK['date'] == '2020-03-24'].describe()
census_fips_code retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
count 0.0 417.000000 417.000000 350.000000 417.000000 419.000000 395.000000
mean NaN -69.597122 -23.323741 -9.688571 -57.889688 -56.830549 23.278481
std NaN 5.480805 6.070756 20.769099 10.808531 7.503583 3.721036
min NaN -94.000000 -79.000000 -88.000000 -87.000000 -82.000000 14.000000
25% NaN -73.000000 -26.000000 -20.000000 -66.000000 -61.000000 21.000000
50% NaN -69.000000 -23.000000 -11.000000 -58.000000 -56.000000 23.000000
75% NaN -67.000000 -20.000000 1.000000 -51.000000 -52.000000 25.000000
max NaN -50.000000 -4.000000 152.000000 -13.000000 -36.000000 37.000000
# Compute descriptive statistics about the start day of third UK lockdown
mobility_trends_UK[mobility_trends_UK['date'] == '2021-01-06'].describe()
census_fips_code retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
count 0.0 413.000000 414.000000 357.000000 416.000000 419.000000 413.000000
mean NaN -58.910412 -21.253623 -8.439776 -57.281250 -48.107399 20.726392
std NaN 6.648178 7.038870 20.869845 10.027203 8.146706 3.576515
min NaN -94.000000 -80.000000 -89.000000 -87.000000 -75.000000 13.000000
25% NaN -62.000000 -25.000000 -20.000000 -64.000000 -53.000000 18.000000
50% NaN -59.000000 -22.000000 -10.000000 -58.000000 -47.000000 20.000000
75% NaN -55.000000 -18.000000 2.000000 -51.000000 -42.000000 23.000000
max NaN -41.000000 0.000000 128.000000 -26.000000 -28.000000 33.000000

Visualising a single time series variable

A time series is a sequence of data points arranged in time order. We import the seaborn library and use the relplot function to plot the relationship between time and mobility change.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_theme(context='notebook', style="darkgrid", palette="pastel", font_scale=1.5)

sns.relplot(x = 'date',
            y = 'workplaces_percent_change_from_baseline',
            height = 5, aspect = 3,
            kind = 'line', lw = 3,
            data = mobility_trends_UK)
<seaborn.axisgrid.FacetGrid at 0x7fe0d2ca18e0>
_images/end_to_end_data_science_project_31_1.png

Data wrangling

Transforming from wide to long data format

In the original data, each mobility category is a separate column, which is known as wide data format. Wide data format is easy to read but restricts us to plotting only one mobility category at a time (unless we employ a for loop). We can plot all mobility categories simultaneously in seaborn after we reshape our data from wide format to long format. Long data format will have one column for all six mobility categories and one column for the values of those categories.

Below is a ‘pandas`schematic of wide (left) and long format data (right):

We use the pandas melt function to reshape our mobility categories from wide to long format. The function transforms a DataFrame into a format where one or more columns are identifier variables (id_vars), while other columns (value_vars) are turned into a long format, returning columns, ‘variable’ and ‘value’. In our example, id_vars arecountry_region and date, and value_vars are the six mobility categories. The melt function takes the following parameters:

  • DataFrame — your pandas DataFrame

  • id_vars — a list of identifier variables

  • value_vars — a list of variables to turn into long format

At the end we use the Pandas method dropna() to remove missing values in the DataFrame.

# From wide to long format using the function melt
mobility_trends_UK_long = pd.melt(mobility_trends_UK, 
                                  id_vars =mobility_trends_UK.columns[[1,2,8]], 
                                  value_vars = mobility_trends_UK.columns[9:15]).dropna()
mobility_trends_UK_long
country_region sub_region_1 date variable value
476 United Kingdom Aberdeen City 2020-02-15 retail_and_recreation_percent_change_from_base... -3.0
477 United Kingdom Aberdeen City 2020-02-16 retail_and_recreation_percent_change_from_base... 6.0
478 United Kingdom Aberdeen City 2020-02-17 retail_and_recreation_percent_change_from_base... 11.0
479 United Kingdom Aberdeen City 2020-02-18 retail_and_recreation_percent_change_from_base... 5.0
480 United Kingdom Aberdeen City 2020-02-19 retail_and_recreation_percent_change_from_base... 2.0
... ... ... ... ... ...
1189021 United Kingdom York 2021-05-31 residential_percent_change_from_baseline 13.0
1189022 United Kingdom York 2021-06-01 residential_percent_change_from_baseline 9.0
1189023 United Kingdom York 2021-06-02 residential_percent_change_from_baseline 10.0
1189024 United Kingdom York 2021-06-03 residential_percent_change_from_baseline 10.0
1189025 United Kingdom York 2021-06-04 residential_percent_change_from_baseline 8.0

1102933 rows × 5 columns

Mobility changes across lockdowns

For each lockdown, we consider the first three weeks for comparability.

  • 2020-03-24 — 2020-04-13

  • 2020-11-05 — 2020-11-25

  • 2021-01-06 — 2021-01-26

# Subsets data about the three lockdowns

lockdown1 = mobility_trends_UK_long[(mobility_trends_UK_long['date'] >= '2020-03-24') & (mobility_trends_UK_long['date'] <= '2020-04-13')]
lockdown2 = mobility_trends_UK_long[(mobility_trends_UK_long['date'] >= '2020-11-05') & (mobility_trends_UK_long['date'] <= '2020-11-25')]
lockdown3 = mobility_trends_UK_long[(mobility_trends_UK_long['date'] >= '2021-01-06') & (mobility_trends_UK_long['date'] <= '2021-01-26')]
  
# Link the three DataFrames into one DataFrame using the Pandas `concat()` function
lockdowns = pd.concat([lockdown1,lockdown2,lockdown3], 
                      keys=["lockdown1", "lockdown2", "lockdown3"]).reset_index()

lockdowns.head()                      
level_0 level_1 country_region sub_region_1 date variable value
0 lockdown1 514 United Kingdom Aberdeen City 2020-03-24 retail_and_recreation_percent_change_from_base... -75.0
1 lockdown1 515 United Kingdom Aberdeen City 2020-03-25 retail_and_recreation_percent_change_from_base... -78.0
2 lockdown1 516 United Kingdom Aberdeen City 2020-03-26 retail_and_recreation_percent_change_from_base... -80.0
3 lockdown1 517 United Kingdom Aberdeen City 2020-03-27 retail_and_recreation_percent_change_from_base... -80.0
4 lockdown1 518 United Kingdom Aberdeen City 2020-03-28 retail_and_recreation_percent_change_from_base... -87.0

Split-Apply-Combine

Using the Pandas method groupby(), we split the data into groups (lockdown by mobility category), apply the function mean(), and combine the results.

# Explore descriptive statistics for one of the lockdown DataFrames
lockdowns.groupby(['level_0', 'variable'], sort=False)['value'].mean()
level_0    variable                                          
lockdown1  retail_and_recreation_percent_change_from_baseline   -76.643915
           grocery_and_pharmacy_percent_change_from_baseline    -33.936180
           parks_percent_change_from_baseline                   -21.391601
           transit_stations_percent_change_from_baseline        -65.397655
           workplaces_percent_change_from_baseline              -64.930514
           residential_percent_change_from_baseline              25.771162
lockdown2  retail_and_recreation_percent_change_from_baseline   -47.537369
           grocery_and_pharmacy_percent_change_from_baseline    -11.587230
           parks_percent_change_from_baseline                     8.789530
           transit_stations_percent_change_from_baseline        -46.186192
           workplaces_percent_change_from_baseline              -33.901246
           residential_percent_change_from_baseline              14.285984
lockdown3  retail_and_recreation_percent_change_from_baseline   -61.713526
           grocery_and_pharmacy_percent_change_from_baseline    -23.850479
           parks_percent_change_from_baseline                    -9.215003
           transit_stations_percent_change_from_baseline        -58.740466
           workplaces_percent_change_from_baseline              -44.636270
           residential_percent_change_from_baseline              18.332545
Name: value, dtype: float64

Visual comparison of the three UK lockdowns

# Display the three lockdowns as a catplot multi-plot
grid = sns.catplot(x = "country_region",
            y = "value",
            hue = 'variable',
            col ='level_0',
            ci=99,
            kind="bar",
            data=lockdowns);
grid.set_ylabels("Mean mobility change from baseline (%)")
<seaborn.axisgrid.FacetGrid at 0x7fe049440820>
_images/end_to_end_data_science_project_41_1.png

Re-cap

Using a Colab computational notebook and Python open source tools, we analysed large real-world COVID-19 mobility data through exploratory data analysis and visualisation to address a research question of public health policy relevance.

Accessible hands-on data analysis

We integrated interactive tools and practical hands-on coding that lower barriers to entry for students with little to no programming skills.

Open reproducible research workflow

We combine research question, Python code, explanatory text (comments), exploratory outputs and visualisations in a single document — anyone can check, reproduce, re-use (when we license open source), and improve our analysis.