Data Design & Data Wrangling

‘Data wrangling’ is the mundane labor of collecting, loading, transforming, and preparing data for exploration and analysis. Although it is hard to put a number on it, estimates indicate that data scientists spend from 50 per cent to 80 per cent of their time on data wrangling. The importance of data wrangling has been recognised in social sciencies as well.

Although data wrangling may seem a-theoretical, some of the decisions you would make would in fact depend on how the data were generated or designed.

In this lab, we will first discuss issues related to data design and will then practice hands-on data wrangling with pandas and other Python libraries for data analysis. We will wrangle open data sets, including the Google Covid-19 Community Mobility Reports and Apple Mobility Trends Reports.

Key themes

  • What are data? How are data generated (i.e., data design)?

  • Data wrangling with pandas: loading, selecting, and transforming data.

Learning resources

Matthew Salganik. 2.3 Ten common characteristics of big data. In Bit by Bit: Social Research in the Digital Age. [Online version freely available]

Lisa Gitelman and Virginia Jackson. “Raw Data” Is an Oxymoron. Introduction. MIT Press.

Sabina Leonelli. Scientific Research and Big Data. The Stanford Encyclopedia of Philosophy.

Sam Lau, Joey Gonzalez, and Deb Nolan. The Data Science Lifecycle. In Principles and Techniques of Data Science.

Steve Lohr. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times.

Alexander Kindel et al. Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge. Socius.

Introduction to Data Processing in Python with Pandas, SciPy 2019 Tutorial by Daniel Chen. See also freely available Jupyter notebooks: 01-intro.ipynb and 02-tidy.ipynb.


Activity: Discussing data biases

After reading the section Ten common characteristics of big data in Matthew Salganik’s book Bit by Bit, we will discuss how each characteristic of big data applies to the digital data sets we are analysing in this and next sections, including the Google COVID-19 Community Mobility Reports and Apple Mobility Trends Reports.

Get started with pandas

The pandas library:

  • is a fast, powerful, and flexible open source tool for doing real world data analysis in Python.

  • offers a diverse range of high-performance tools for data loading, cleaning, wrangling, merging, reshaping, and summarising.

  • is the go-to data science library in Python.

Importing pandas

To use pandas, we first import the library via the Python’s import command and give it the alias pd:

import pandas as pd

pd.__version__

# Suppress warnings to avoid potential confusion
import warnings
warnings.filterwarnings("ignore")

Loading your data

Pandas supports many data file formats, including csv, excel, sql, json. For details, see How do I read and write tabular data? For example, once you import pandas as pd, you can load a .csv file using the function pd.read_csv(), an Excel file using the function pd.read_excel(), and a JSON file using the function pd.read_json().

Loading Covid-19 data sets via URL

What are the Google Covid-19 Community Mobility Reports?

  • Aggregated and anonymized sets of freely available data that protect individual privacy.

  • Shows trends of human mobility over time by country and region, across six categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

  • The data shows how visitors to (or time spent in) categorized places change compared to baseline days. A baseline day represents a normal value for that day of the week. The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.

An overview of the data from the mobility data can be found here.

The code below loads the most recent version of the Covid-19 Google Community Mobility Reports as a pandas DataFrame object and assign the object to a variable called ‘mobility_trends’

mobility_trends = pd.read_csv(
    "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv",
    parse_dates=["date"],
)
mobility_trends
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-15 0.0 4.0 5.0 0.0 2.0 1.0
1 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-16 1.0 4.0 4.0 1.0 2.0 1.0
2 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-17 -1.0 1.0 5.0 1.0 2.0 1.0
3 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-18 -2.0 1.0 5.0 0.0 2.0 1.0
4 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-19 -2.0 0.0 4.0 -1.0 2.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9900501 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-10 NaN NaN NaN NaN 126.0 NaN
9900502 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-11 NaN NaN NaN NaN 129.0 NaN
9900503 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-12 NaN NaN NaN NaN 116.0 NaN
9900504 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-13 NaN NaN NaN NaN 118.0 NaN
9900505 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-16 NaN NaN NaN NaN 100.0 NaN

9900506 rows × 15 columns

Pandas load the data as a DataFrame object. We can confirm this by using the built-in function type():

type(mobility_trends)
pandas.core.frame.DataFrame

What is a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Viewing, Describing, and Accessing your Data

Once we have our data as a DataFrame data structure, we can view, describe, and access the data by applying basic functionalities associated with that structure. For example, we can perform various operations by using the so called methods. Examples of pandas methods are head() and tail(); head() displays by default the top five rows of the data and tail() displays by default the last five rows. You can display a custom number of row by passing that number in the brackets.

View your data

# Show the first five rows using the method DataFrame.head()
mobility_trends.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-15 0.0 4.0 5.0 0.0 2.0 1.0
1 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-16 1.0 4.0 4.0 1.0 2.0 1.0
2 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-17 -1.0 1.0 5.0 1.0 2.0 1.0
3 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-18 -2.0 1.0 5.0 0.0 2.0 1.0
4 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-19 -2.0 0.0 4.0 -1.0 2.0 1.0
# Show the last five rows using the method DataFrame.tail()
mobility_trends.tail()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
9900501 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-10 NaN NaN NaN NaN 126.0 NaN
9900502 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-11 NaN NaN NaN NaN 129.0 NaN
9900503 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-12 NaN NaN NaN NaN 116.0 NaN
9900504 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-13 NaN NaN NaN NaN 118.0 NaN
9900505 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-16 NaN NaN NaN NaN 100.0 NaN
# Specify the number of rows to return
mobility_trends.tail(10)
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
9900496 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-03 NaN NaN NaN NaN 119.0 NaN
9900497 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-04 NaN NaN NaN NaN 116.0 NaN
9900498 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-05 NaN NaN NaN NaN 110.0 NaN
9900499 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-06 NaN NaN NaN NaN 124.0 NaN
9900500 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-09 NaN NaN NaN NaN 98.0 NaN
9900501 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-10 NaN NaN NaN NaN 126.0 NaN
9900502 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-11 NaN NaN NaN NaN 129.0 NaN
9900503 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-12 NaN NaN NaN NaN 116.0 NaN
9900504 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-13 NaN NaN NaN NaN 118.0 NaN
9900505 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2022-05-16 NaN NaN NaN NaN 100.0 NaN

Describing your DataFrame

# Dimensionality — number of rows and columns — of a DataFrame
mobility_trends.shape
(9900506, 15)
# Use the print function to display the number of rows and columns
# in a DataFrame at the time of cell execution.
# Import the 'datetime' module (used below to display the current date)
import datetime

print(
    "\nAs of",
    datetime.datetime.now(),
    "The Google COVID-19 Community Mobility Reports contain",
    mobility_trends.shape[0],
    "rows and",
    mobility_trends.shape[1],
    "columns.",
)
As of 2022-05-19 17:59:52.163517 The Google COVID-19 Community Mobility Reports contain 9900506 rows and 15 columns.
# Accessing columns using the DataFrame.columns attribute
mobility_trends.columns
Index(['country_region_code', 'country_region', 'sub_region_1', 'sub_region_2',
       'metro_area', 'iso_3166_2_code', 'census_fips_code', 'place_id', 'date',
       'retail_and_recreation_percent_change_from_baseline',
       'grocery_and_pharmacy_percent_change_from_baseline',
       'parks_percent_change_from_baseline',
       'transit_stations_percent_change_from_baseline',
       'workplaces_percent_change_from_baseline',
       'residential_percent_change_from_baseline'],
      dtype='object')

Let’s display a concise summary of our DataFrame using the method info()

# Information about a DataFrame
mobility_trends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9900506 entries, 0 to 9900505
Data columns (total 15 columns):
 #   Column                                              Dtype         
---  ------                                              -----         
 0   country_region_code                                 object        
 1   country_region                                      object        
 2   sub_region_1                                        object        
 3   sub_region_2                                        object        
 4   metro_area                                          object        
 5   iso_3166_2_code                                     object        
 6   census_fips_code                                    float64       
 7   place_id                                            object        
 8   date                                                datetime64[ns]
 9   retail_and_recreation_percent_change_from_baseline  float64       
 10  grocery_and_pharmacy_percent_change_from_baseline   float64       
 11  parks_percent_change_from_baseline                  float64       
 12  transit_stations_percent_change_from_baseline       float64       
 13  workplaces_percent_change_from_baseline             float64       
 14  residential_percent_change_from_baseline            float64       
dtypes: datetime64[ns](1), float64(7), object(7)
memory usage: 1.1+ GB

Among other information, displayed are labels of the columns and the data type of each column. For example ‘float64’ refers to a floating point number, i.e., a number with a decimal point. In the Community Mobility Reports, floating point numbers are used to display the percentage mobility change for each category of places.

Accessing columns and rows in your data

Accessing columns

You can can access columns via column name and column position.

  • Accessing columns via column name

Let’s access a single column.

# Get the country column and save it to its own variable
# The double square bracket option `[[]]` gives DataFrame

mobility_trends_countries = mobility_trends[["country_region"]]
mobility_trends_countries.head()
country_region
0 United Arab Emirates
1 United Arab Emirates
2 United Arab Emirates
3 United Arab Emirates
4 United Arab Emirates
# Display the type of data structure

type(mobility_trends_countries)
pandas.core.frame.DataFrame
# The single square braket `[]` option gives Series
# pandas Series is a one-dimensional data structure or simply put, a single column

mobility_trends_countries_series = mobility_trends["country_region"].head()
# Display the type of data structure
type(mobility_trends_countries_series)
pandas.core.series.Series

Let’s now access multiple columns. We can access multiple columns by using a Python list. We use square brakets to create a list, for example ['country_region', 'sub_region_1', 'date']. Python lists are useful as they store multiple items in the same variable.

mobility_trends[["country_region", "sub_region_1", "date"]].head()
country_region sub_region_1 date
0 United Arab Emirates NaN 2020-02-15
1 United Arab Emirates NaN 2020-02-16
2 United Arab Emirates NaN 2020-02-17
3 United Arab Emirates NaN 2020-02-18
4 United Arab Emirates NaN 2020-02-19
  • Accessing columns via column position

# Accessing columns 1, 2, adn 7 via column position and show the top five rows

mobility_trends.iloc[:, [1, 2, 7]].head()
country_region sub_region_1 place_id
0 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
1 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
2 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
3 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
4 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
# Accessing a subset of rows and columns. We access the top 3 rows and columns 1, 2, and 7

mobility_trends.iloc[0:3, [1, 2, 7]].head()
country_region sub_region_1 place_id
0 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
1 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
2 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM

Hands-on exercise

  1. Create a new code cell.

  2. Access the first 100 rows and the following columns from the DataFrame mobility_trends:

    • country_region_code

    • date

    • retail_recreation

  3. Assign the output to a variable called mobility_trends_subset.

Accessing rows

Rows can be accessed via row labels df.loc and row index df.iloc.

# Before accessing particular rows, let's see the names of all countries in the dataset.
# We can achieve that by listing all unique values in the 'country_region' column.
mobility_trends.country_region.unique()
array(['United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda',
       'Angola', 'Argentina', 'Austria', 'Australia', 'Aruba',
       'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium',
       'Burkina Faso', 'Bulgaria', 'Bahrain', 'Benin', 'Bolivia',
       'Brazil', 'The Bahamas', 'Botswana', 'Belarus', 'Belize', 'Canada',
       'Switzerland', "Côte d'Ivoire", 'Chile', 'Cameroon', 'Colombia',
       'Costa Rica', 'Cape Verde', 'Czechia', 'Germany', 'Denmark',
       'Dominican Republic', 'Ecuador', 'Estonia', 'Egypt', 'Spain',
       'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia',
       'Ghana', 'Greece', 'Guatemala', 'Guinea-Bissau', 'Hong Kong',
       'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland',
       'Israel', 'India', 'Iraq', 'Italy', 'Jamaica', 'Jordan', 'Japan',
       'Kenya', 'Kyrgyzstan', 'Cambodia', 'South Korea', 'Kuwait',
       'Kazakhstan', 'Laos', 'Lebanon', 'Liechtenstein', 'Sri Lanka',
       'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Moldova',
       'North Macedonia', 'Mali', 'Myanmar (Burma)', 'Mongolia', 'Malta',
       'Mauritius', 'Mexico', 'Malaysia', 'Mozambique', 'Namibia',
       'Niger', 'Nigeria', 'Nicaragua', 'Netherlands', 'Norway', 'Nepal',
       'New Zealand', 'Oman', 'Panama', 'Peru', 'Papua New Guinea',
       'Philippines', 'Pakistan', 'Poland', 'Puerto Rico', 'Portugal',
       'Paraguay', 'Qatar', 'Réunion', 'Romania', 'Serbia', 'Russia',
       'Rwanda', 'Saudi Arabia', 'Sweden', 'Singapore', 'Slovenia',
       'Slovakia', 'Senegal', 'El Salvador', 'Togo', 'Thailand',
       'Tajikistan', 'Turkey', 'Trinidad and Tobago', 'Taiwan',
       'Tanzania', 'Ukraine', 'Uganda', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'Yemen', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)
# Accessing specific rows from a DataFrame.
# We are interested in the data about the United Kingdom.

mobility_trends_UK = mobility_trends[
    mobility_trends["country_region"] == "United Kingdom"
]

mobility_trends_UK.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3647576 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
3647577 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
3647578 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
3647579 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
3647580 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
# Accessing data about multiple countries using the isin() method.
mobility_trends_countries = mobility_trends[
    mobility_trends["country_region"].isin(
        ["United Kingdom", "Germany", "Italy", "Sweden"]
    )
]

mobility_trends_countries.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3174838 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-15 6.0 1.0 45.0 10.0 0.0 -1.0
3174839 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-16 7.0 10.0 9.0 6.0 -1.0 0.0
3174840 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-17 2.0 2.0 7.0 1.0 -2.0 0.0
3174841 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-18 2.0 2.0 10.0 1.0 -1.0 1.0
3174842 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-19 3.0 0.0 6.0 -1.0 -1.0 1.0
# Filter by two conditions — country and county — simultenously
# First let's see the list of counties in the dataset

mobility_trends[
    mobility_trends["country_region"] == "United Kingdom"
].sub_region_1.unique()
array([nan, 'Aberdeen City', 'Aberdeenshire', 'Angus Council',
       'Antrim and Newtownabbey', 'Ards and North Down',
       'Argyll and Bute Council', 'Armagh City, Banbridge and Craigavon',
       'Bath and North East Somerset', 'Bedford', 'Belfast',
       'Blackburn with Darwen', 'Blackpool', 'Blaenau Gwent',
       'Borough of Halton', 'Bracknell Forest', 'Bridgend County Borough',
       'Brighton and Hove', 'Bristol City', 'Buckinghamshire',
       'Caerphilly County Borough', 'Cambridgeshire', 'Cardiff',
       'Carmarthenshire', 'Causeway Coast and Glens',
       'Central Bedfordshire', 'Ceredigion', 'Cheshire East',
       'Cheshire West and Chester', 'Clackmannanshire',
       'Conwy Principal Area', 'Cornwall', 'County Durham', 'Cumbria',
       'Darlington', 'Denbighshire', 'Derby', 'Derbyshire',
       'Derry and Strabane', 'Devon', 'Dorset', 'Dumfries and Galloway',
       'Dundee City Council', 'East Ayrshire Council',
       'East Dunbartonshire Council', 'East Lothian Council',
       'East Renfrewshire Council', 'East Riding of Yorkshire',
       'East Sussex', 'Edinburgh', 'Essex', 'Falkirk',
       'Fermanagh and Omagh', 'Fife', 'Flintshire', 'Glasgow City',
       'Gloucestershire', 'Greater London', 'Greater Manchester',
       'Gwynedd', 'Hampshire', 'Hartlepool', 'Herefordshire',
       'Hertfordshire', 'Highland Council', 'Inverclyde',
       'Isle of Anglesey', 'Isle of Wight', 'Kent', 'Kingston upon Hull',
       'Lancashire', 'Leicester', 'Leicestershire', 'Lincolnshire',
       'Lisburn and Castlereagh', 'Luton', 'Medway', 'Merseyside',
       'Merthyr Tydfil County Borough', 'Mid and East Antrim',
       'Mid Ulster', 'Middlesbrough', 'Midlothian', 'Milton Keynes',
       'Monmouthshire', 'Moray', 'Na h-Eileanan an Iar',
       'Neath Port Talbot Principle Area', 'Newport',
       'Newry, Mourne and Down', 'Norfolk', 'North Ayrshire Council',
       'North East Lincolnshire', 'North Lanarkshire',
       'North Lincolnshire', 'North Somerset', 'North Yorkshire',
       'Northamptonshire', 'Northumberland', 'Nottingham',
       'Nottinghamshire', 'Orkney', 'Oxfordshire', 'Pembrokeshire',
       'Perth and Kinross', 'Peterborough', 'Plymouth', 'Portsmouth',
       'Powys', 'Reading', 'Redcar and Cleveland', 'Renfrewshire',
       'Rhondda Cynon Taff', 'Rutland', 'Scottish Borders',
       'Shetland Islands', 'Shropshire', 'Slough', 'Somerset',
       'South Ayrshire Council', 'South Gloucestershire',
       'South Lanarkshire', 'South Yorkshire', 'Southampton',
       'Southend-on-Sea', 'Staffordshire', 'Stirling', 'Stockton-on-Tees',
       'Stoke-on-Trent', 'Suffolk', 'Surrey', 'Swansea', 'Swindon',
       'Thurrock', 'Torbay', 'Torfaen Principal Area', 'Tyne and Wear',
       'Vale of Glamorgan', 'Warrington', 'Warwickshire',
       'West Berkshire', 'West Dunbartonshire Council', 'West Lothian',
       'West Midlands', 'West Sussex', 'West Yorkshire', 'Wiltshire',
       'Windsor and Maidenhead', 'Wokingham', 'Worcestershire',
       'Wrexham Principal Area', 'York'], dtype=object)
# Access data about UK and Essex
mobility_trends_UK_Essex = mobility_trends[
    (mobility_trends["country_region"] == "United Kingdom")
    & (mobility_trends["sub_region_1"] == "Essex")
]

mobility_trends_UK_Essex.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3720375 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-15 -9.0 -7.0 -27.0 -11.0 -3.0 2.0
3720376 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-16 -16.0 -12.0 -45.0 -9.0 -5.0 2.0
3720377 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-17 9.0 -1.0 23.0 -1.0 -16.0 3.0
3720378 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-18 10.0 -4.0 15.0 0.0 -16.0 2.0
3720379 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-19 9.0 -3.0 10.0 1.0 -16.0 2.0

In the above example, you could simply type in

mobility_trends[mobility_trends['sub_region_1']=='Essex']

The command will return the same result as there is no region called ‘Essex’ outside the UK. However, there are instances in the data where regions in different countries have the same name. When this is the case, not specifying the country would return confusing outputs.

Accessing multiple rows and columns and conditioning

mobility_trends_UK
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3647576 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
3647577 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
3647578 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
3647579 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
3647580 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3990325 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-12 -3.0 10.0 91.0 -20.0 -25.0 5.0
3990326 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-13 -9.0 8.0 71.0 -12.0 -26.0 5.0
3990327 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-14 -15.0 6.0 23.0 -7.0 -5.0 -1.0
3990328 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-15 -8.0 9.0 7.0 -10.0 -5.0 0.0
3990329 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-16 -5.0 9.0 12.0 -24.0 -30.0 6.0

342754 rows × 15 columns

# Let's see which UK counties had the lower retail and recreation mobility
# the day after Italy went in lockdown on 9 March 2020
mobility_trends_UK_10MARCH2020 = mobility_trends_UK.loc[
    (mobility_trends_UK["date"] == "2020-03-10")
    & (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
    ["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]

# Sort in decreasing order
mobility_trends_UK_10MARCH2020.sort_values(
    by="retail_and_recreation_percent_change_from_baseline", ascending=True
)
sub_region_1 retail_and_recreation_percent_change_from_baseline
3739259 Gloucestershire -16.0
3741716 Greater London -14.0
3763910 Greater London -13.0
3742538 Greater London -12.0
3758978 Greater London -11.0
... ... ...
3948491 Tyne and Wear -1.0
3885110 Nottinghamshire -1.0
3772952 Greater Manchester -1.0
3930454 Suffolk -1.0
3803289 Kent -1.0

167 rows × 2 columns

# UK counties with the lower retail and recreation
# the day after UK went in lockdown on 23 March 2020
mobility_trends_UK_24MARCH2020 = mobility_trends_UK.loc[
    (mobility_trends_UK["date"] == "2020-03-24")
    & (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
    ["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]

# Sort in decreasing order
mobility_trends_UK_24MARCH2020.sort_values(
    by="retail_and_recreation_percent_change_from_baseline", ascending=True
)
sub_region_1 retail_and_recreation_percent_change_from_baseline
3741730 Greater London -94.0
3742552 Greater London -92.0
3747484 Greater London -87.0
3767212 Greater London -84.0
3756526 Greater London -83.0
... ... ...
3969853 West Sussex -56.0
3686799 Cumbria -56.0
3808235 Kent -54.0
3809879 Kent -54.0
3842701 Merseyside -50.0

417 rows × 2 columns

Let’s identify the places in the UK with the highest reduction of workplaces mobility.

# We need to first find the index of the row
# with the highest reduction of workplaces mobility
mobility_trends_UK_workplaces = mobility_trends_UK[
    "workplaces_percent_change_from_baseline"
].idxmin()

mobility_trends_UK_workplaces
3756546
# Use the above index to access the row
# with the highest reduction of workplaces mobility
mobility_trends.iloc[[mobility_trends_UK_workplaces]]
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3756546 GB United Kingdom Greater London London Borough of Islington NaN GB-ISL NaN ChIJ5Tvr7V0bdkgRyWXsGMuF8FQ 2020-04-13 -88.0 -51.0 -28.0 -90.0 -91.0 40.0

In our basic exploratory data analysis, we found that as of 21 April 2021 the largest percentage mobility change in workplaces was in the London Borough of Islington, Greater London, on 13 April 2020 when the workplaces mobility was reduced with 91 per cent compared to a baseline period. This result is only for exploratory purposes. Note that the same mobility in two different days would result in different percentage changes due to the way in which actual mobility is compared to a baseline in the Google Covid-19 Community Mobility Reports.

Handling Missing Values

Many data sets have missing values and the mobility data is not an exception. Handling missing values is a very common problem in data analysis. We will cover two simple ways of handling missing values in data:

  • Deletion — missing observations are simply dropped from the data set.

    • Likewise - removes all data for an observation that has one or more missing values.

    • Pairwise — removes observations with missing values only for the variables used in the current procedure (e.g., model or technique).

  • Imputation — missing observations are replaced with inferred values. We will cover simple imputation methods in which missing values are replaced with column mean or row mean.

For more advanced methods and functionality, including interpolation methods, see the section ‘Working with missing data’ in Pandas documentation.

Detecting missing values

In pandas, missing values are represented as NaN, which stands for Not a Number. To detect missing values, we can use the function isna() (other functions, for example isnull() can be used to the same effect). The function returns boolean values (False, True) indicating whether each corresponding value in a DataFrame is missing.

mobility_trends.isna()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 False False True True True True True False False False False False False False False
1 False False True True True True True False False False False False False False False
2 False False True True True True True False False False False False False False False
3 False False True True True True True False False False False False False False False
4 False False True True True True True False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9900501 False False False False True True True False False True True True True False True
9900502 False False False False True True True False False True True True True False True
9900503 False False False False True True True False False True True True True False True
9900504 False False False False True True True False False True True True True False True
9900505 False False False False True True True False False True True True True False True

9900506 rows × 15 columns

Let’s determine the number of missing values for each column in the DataFrame.

mobility_trends.isna().sum()
country_region_code                                      6183
country_region                                              0
sub_region_1                                           163754
sub_region_2                                          1610543
metro_area                                            9847293
iso_3166_2_code                                       8173140
census_fips_code                                      7793263
place_id                                                18182
date                                                        0
retail_and_recreation_percent_change_from_baseline    3769374
grocery_and_pharmacy_percent_change_from_baseline     4048994
parks_percent_change_from_baseline                    5217709
transit_stations_percent_change_from_baseline         4988015
workplaces_percent_change_from_baseline                354374
residential_percent_change_from_baseline              3846476
dtype: int64

It would be useful to quantify the percentage of values that are missing per column. To achieve this, we simply divide the number of missing values in a column by the number of rows in the DataFrame (and multiply the result by 100).

mobility_trends.isna().sum() / mobility_trends.shape[0] * 100
country_region_code                                    0.062451
country_region                                         0.000000
sub_region_1                                           1.653996
sub_region_2                                          16.267280
metro_area                                            99.462522
iso_3166_2_code                                       82.552750
census_fips_code                                      78.715805
place_id                                               0.183647
date                                                   0.000000
retail_and_recreation_percent_change_from_baseline    38.072539
grocery_and_pharmacy_percent_change_from_baseline     40.896839
parks_percent_change_from_baseline                    52.701438
transit_stations_percent_change_from_baseline         50.381415
workplaces_percent_change_from_baseline                3.579352
residential_percent_change_from_baseline              38.851307
dtype: float64

Let’s now repeat the procedure for the UK mobility data only:

mobility_trends_UK.isna().sum() / mobility_trends_UK.shape[0] * 100
country_region_code                                     0.000000
country_region                                          0.000000
sub_region_1                                            0.239822
sub_region_2                                           36.107237
metro_area                                            100.000000
iso_3166_2_code                                        49.442166
census_fips_code                                      100.000000
place_id                                                0.000000
date                                                    0.000000
retail_and_recreation_percent_change_from_baseline      2.760289
grocery_and_pharmacy_percent_change_from_baseline       2.684724
parks_percent_change_from_baseline                     21.118353
transit_stations_percent_change_from_baseline           3.234390
workplaces_percent_change_from_baseline                 0.518739
residential_percent_change_from_baseline                5.692713
dtype: float64

The output shows that the missing values in most mobility categories are below 5 per cent except the missing values for residential mobility (8%) and parks mobility (21.7%).

Removing missing values

We can remove missing values using the Pandas function dropna(). The default functionality removes rows that contain at least one missing value. The dropna() function provides other capabilities as well. For example, it is possible to remove columns (instead of rows) that contain missing values by specifying axis = 1 within the brackets. Also, we can specify under what conditions rows or columns are removed: use how = any if you would like to drop rows or columns that contain any missing value, which is the default functionality; use how = all if you would like to drop rows or columns in which all values are missing.

mobility_trends_UK.dropna()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline

We did not specify particular columns on which to perform the dropna() function. Because we have a column (‘metro_area’) for which all values are missing, the dropna() function removed all observations from the DataFrame. This is analogous to a likewise deletion of missing values.

A more informative approach would be to specify the columns we are going to use in our analysis and drop missing values only for this subset of variables. We can drop rows that contain any missing values in the six mobility categories by passing the labels of those categories to the parameter subset as shown below:

mobility_trends_UK.dropna(subset=mobility_trends_UK.iloc[:, 9:15].columns)
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3647576 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
3647577 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
3647578 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
3647579 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
3647580 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3990325 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-12 -3.0 10.0 91.0 -20.0 -25.0 5.0
3990326 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-13 -9.0 8.0 71.0 -12.0 -26.0 5.0
3990327 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-14 -15.0 6.0 23.0 -7.0 -5.0 -1.0
3990328 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-15 -8.0 9.0 7.0 -10.0 -5.0 0.0
3990329 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2022-05-16 -5.0 9.0 12.0 -24.0 -30.0 6.0

255595 rows × 15 columns

mobility_trends_UK.shape
(342754, 15)

After removing the rows that contain missing values in the six mobility categories/columns, we are left with 137992 rows out of 191064 rows, a reduction of approximately 28%.

Replacing missing values

Let’s replace NaN values with the mean of non-missing values. For example, we can replace missing values in the columns ‘workplaces_percent_change_from_baseline’ and ‘residential_percent_change_from_baseline’ by the mean of values in the two columns as determined by the Pandas mean() method.

# Replace the NaNs in the columns 'workplaces_percent_change_from_baseline'
# and 'residential_percent_change_from_baseline'
# by the mean of the non-missing values.
mobility_trends_UK[
    [
        "workplaces_percent_change_from_baseline",
        "residential_percent_change_from_baseline",
    ]
].fillna(
    value=mobility_trends_UK[
        [
            "workplaces_percent_change_from_baseline",
            "residential_percent_change_from_baseline",
        ]
    ].mean()
)
workplaces_percent_change_from_baseline residential_percent_change_from_baseline
3647576 -4.0 2.0
3647577 -3.0 1.0
3647578 -14.0 2.0
3647579 -14.0 2.0
3647580 -14.0 3.0
... ... ...
3990325 -25.0 5.0
3990326 -26.0 5.0
3990327 -5.0 -1.0
3990328 -5.0 0.0
3990329 -30.0 6.0

342754 rows × 2 columns

We now have the same number of rows (191064) as the original mobility_trends_UK DataFrame, with each missing value replaced by the mean value for the respective column. More advanced (and practical) approaches of replacing missing values can take into account groupings in the data and replace missing values with the mean value of a group of counties or the mean value of a time period; those operations would require more advanced Pandas functionality such as groupby(), which we will use in the next lab.

Compare mobility data formats

A related aggregate, privacy-preserving, and freely available mobility data set is the Apple’s Mobility Trends Reports. The data is updated daily and reflect in aggregate requests for directions in Apple Maps for select countries/regions, sub-regions and cities.

The Mobility Trends Reports data can be downloaded from the website https://covid19.apple.com/mobility. However, because the data file is not at a stable location (URL), we cannot load the data from the website but instead access the data from a COVID-19 mobility data aggregator on GitHub.

mobility_trends_covid_apple = pd.read_csv(
    "https://raw.githubusercontent.com/ActiveConclusion/COVID19_mobility/master/apple_reports/applemobilitytrends.csv"
)
mobility_trends_covid_apple.head()
geo_type region transportation_type alternative_name sub-region country 2020-01-13 2020-01-14 2020-01-15 2020-01-16 ... 2022-04-03 2022-04-04 2022-04-05 2022-04-06 2022-04-07 2022-04-08 2022-04-09 2022-04-10 2022-04-11 2022-04-12
0 country/region Albania driving NaN NaN NaN 100.0 95.30 101.43 97.20 ... 149.58 156.76 156.76 151.02 157.86 185.12 196.62 173.11 173.21 171.09
1 country/region Albania walking NaN NaN NaN 100.0 100.68 98.93 98.46 ... 161.71 215.31 221.62 211.34 189.25 244.22 231.90 154.93 223.97 223.45
2 country/region Argentina driving NaN NaN NaN 100.0 97.07 102.45 111.21 ... 93.32 97.74 102.63 106.26 111.81 145.45 158.56 99.20 100.82 114.56
3 country/region Argentina walking NaN NaN NaN 100.0 95.11 101.37 112.67 ... 88.58 126.43 133.06 135.64 142.40 178.26 172.83 98.96 126.57 139.29
4 country/region Australia driving AU NaN NaN 100.0 102.98 104.21 108.63 ... 123.95 116.32 132.29 134.97 145.09 146.34 125.79 136.60 139.01 146.75

5 rows × 827 columns

Activity: Discussing data formats

Compare the format of the data in the Apple Mobility Trends Reports to the format of the data in the Google COVID-19 Community Mobility Reports and in Our World in Data COVID-19 data. Focus in particular on similarities and differences in how time series data is represented across data sets.

Recording dependencies

# Install the watermark extension
!pip install -q watermark

# Load the watermark extension
%load_ext watermark

# Show packages that were imported
%watermark --iversions
pandas: 1.1.3

Save a list of the packages (and their versions) used in the current notebook to a file named requirements.txt.

# Run pipreqsnb after specifying the path to the notebook
!pipreqsnb ../notebooks/04_data_design_and_data_wrangling.ipynb
pipreqs  --savepath ../notebooks/requirements.txt .//__temp_pipreqsnb_folder/
INFO: Successfully saved requirements file in ../notebooks/requirements.txt