Data Design & Data Wrangling¶

‘Data wrangling’ is the mundane labor of collecting, loading, transforming, and preparing data for exploration and analysis. Although it is hard to put a number on it, estimates indicate that data scientists spend from 50 per cent to 80 per cent of their time on data wrangling. The importance of data wrangling has been recognised in social sciencies as well.

Although data wrangling may seem a-theoretical, some of the decisions you would make would in fact depend on how the data were generated or designed.

In this lab, we will first discuss issues related to data design and will then practice hands-on data wrangling with pandas and other Python libraries for data analysis. We will wrangle open data sets, including the Google Covid-19 Community Mobility Reports and Apple Mobility Trends Reports.

Key themes¶

What are data? How are data generated (i.e., data design)?
Data wrangling with pandas: loading, selecting, and transforming data.

Learning resources¶

Matthew Salganik. 2.3 Ten common characteristics of big data. In Bit by Bit: Social Research in the Digital Age. [Online version freely available]

Lisa Gitelman and Virginia Jackson. “Raw Data” Is an Oxymoron. Introduction. MIT Press.

Sabina Leonelli. Scientific Research and Big Data. The Stanford Encyclopedia of Philosophy.

Sam Lau, Joey Gonzalez, and Deb Nolan. The Data Science Lifecycle. In Principles and Techniques of Data Science.

Steve Lohr. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times.

Alexander Kindel et al. Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge. Socius.

Introduction to Data Processing in Python with Pandas, SciPy 2019 Tutorial by Daniel Chen. See also freely available Jupyter notebooks: 01-intro.ipynb and 02-tidy.ipynb.

Activity: Discussing data biases

After reading the section Ten common characteristics of big data in Matthew Salganik’s book Bit by Bit, we will discuss how each characteristic of big data applies to the digital data sets we are analysing in this and next sections, including the Google COVID-19 Community Mobility Reports and Apple Mobility Trends Reports.

Get started with `pandas`¶

The pandas library:

is a fast, powerful, and flexible open source tool for doing real world data analysis in Python.
offers a diverse range of high-performance tools for data loading, cleaning, wrangling, merging, reshaping, and summarising.
is the go-to data science library in Python.

Importing `pandas`¶

To use pandas, we first import the library via the Python’s import command and give it the alias pd:

import pandas as pd

pd.__version__

# Suppress warnings to avoid potential confusion
import warnings
warnings.filterwarnings("ignore")

Loading your data¶

Pandas supports many data file formats, including csv, excel, sql, json. For details, see How do I read and write tabular data? For example, once you import pandas as pd, you can load a .csv file using the function pd.read_csv(), an Excel file using the function pd.read_excel(), and a JSON file using the function pd.read_json().

Loading Covid-19 data sets via URL¶

What are the Google Covid-19 Community Mobility Reports?

Aggregated and anonymized sets of freely available data that protect individual privacy.
Shows trends of human mobility over time by country and region, across six categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
The data shows how visitors to (or time spent in) categorized places change compared to baseline days. A baseline day represents a normal value for that day of the week. The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.

An overview of the data from the mobility data can be found here.

The code below loads the most recent version of the Covid-19 Google Community Mobility Reports as a pandas DataFrame object and assign the object to a variable called ‘mobility_trends’

mobility_trends = pd.read_csv(
    "https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv",
    parse_dates=["date"],
)

mobility_trends

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
0	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-15	0.0	4.0	5.0	0.0	2.0	1.0
1	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-16	1.0	4.0	4.0	1.0	2.0	1.0
2	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-17	-1.0	1.0	5.0	1.0	2.0	1.0
3	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-18	-2.0	1.0	5.0	0.0	2.0	1.0
4	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-19	-2.0	0.0	4.0	-1.0	2.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9900501	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-10	NaN	NaN	NaN	NaN	126.0	NaN
9900502	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-11	NaN	NaN	NaN	NaN	129.0	NaN
9900503	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-12	NaN	NaN	NaN	NaN	116.0	NaN
9900504	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-13	NaN	NaN	NaN	NaN	118.0	NaN
9900505	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-16	NaN	NaN	NaN	NaN	100.0	NaN

9900506 rows × 15 columns

Pandas load the data as a DataFrame object. We can confirm this by using the built-in function type():

type(mobility_trends)

pandas.core.frame.DataFrame

What is a DataFrame?¶

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Viewing, Describing, and Accessing your Data¶

Once we have our data as a DataFrame data structure, we can view, describe, and access the data by applying basic functionalities associated with that structure. For example, we can perform various operations by using the so called methods. Examples of pandas methods are head() and tail(); head() displays by default the top five rows of the data and tail() displays by default the last five rows. You can display a custom number of row by passing that number in the brackets.

View your data¶

# Show the first five rows using the method DataFrame.head()
mobility_trends.head()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
0	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-15	0.0	4.0	5.0	0.0	2.0	1.0
1	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-16	1.0	4.0	4.0	1.0	2.0	1.0
2	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-17	-1.0	1.0	5.0	1.0	2.0	1.0
3	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-18	-2.0	1.0	5.0	0.0	2.0	1.0
4	AE	United Arab Emirates	NaN	NaN	NaN	NaN	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM	2020-02-19	-2.0	0.0	4.0	-1.0	2.0	1.0

# Show the last five rows using the method DataFrame.tail()
mobility_trends.tail()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
9900501	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-10	NaN	NaN	NaN	NaN	126.0	NaN
9900502	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-11	NaN	NaN	NaN	NaN	129.0	NaN
9900503	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-12	NaN	NaN	NaN	NaN	116.0	NaN
9900504	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-13	NaN	NaN	NaN	NaN	118.0	NaN
9900505	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-16	NaN	NaN	NaN	NaN	100.0	NaN

# Specify the number of rows to return
mobility_trends.tail(10)

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
9900496	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-03	NaN	NaN	NaN	NaN	119.0	NaN
9900497	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-04	NaN	NaN	NaN	NaN	116.0	NaN
9900498	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-05	NaN	NaN	NaN	NaN	110.0	NaN
9900499	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-06	NaN	NaN	NaN	NaN	124.0	NaN
9900500	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-09	NaN	NaN	NaN	NaN	98.0	NaN
9900501	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-10	NaN	NaN	NaN	NaN	126.0	NaN
9900502	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-11	NaN	NaN	NaN	NaN	129.0	NaN
9900503	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-12	NaN	NaN	NaN	NaN	116.0	NaN
9900504	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-13	NaN	NaN	NaN	NaN	118.0	NaN
9900505	ZW	Zimbabwe	Midlands Province	Kwekwe	NaN	NaN	NaN	ChIJRcIZ3-FJNBkRRsj55IcLpfU	2022-05-16	NaN	NaN	NaN	NaN	100.0	NaN

Describing your DataFrame¶

# Dimensionality — number of rows and columns — of a DataFrame
mobility_trends.shape

(9900506, 15)

# Use the print function to display the number of rows and columns
# in a DataFrame at the time of cell execution.
# Import the 'datetime' module (used below to display the current date)
import datetime

print(
    "\nAs of",
    datetime.datetime.now(),
    "The Google COVID-19 Community Mobility Reports contain",
    mobility_trends.shape[0],
    "rows and",
    mobility_trends.shape[1],
    "columns.",
)

As of 2022-05-19 17:59:52.163517 The Google COVID-19 Community Mobility Reports contain 9900506 rows and 15 columns.

# Accessing columns using the DataFrame.columns attribute
mobility_trends.columns

Index(['country_region_code', 'country_region', 'sub_region_1', 'sub_region_2',
       'metro_area', 'iso_3166_2_code', 'census_fips_code', 'place_id', 'date',
       'retail_and_recreation_percent_change_from_baseline',
       'grocery_and_pharmacy_percent_change_from_baseline',
       'parks_percent_change_from_baseline',
       'transit_stations_percent_change_from_baseline',
       'workplaces_percent_change_from_baseline',
       'residential_percent_change_from_baseline'],
      dtype='object')

Let’s display a concise summary of our DataFrame using the method info()

# Information about a DataFrame
mobility_trends.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9900506 entries, 0 to 9900505
Data columns (total 15 columns):
 #   Column                                              Dtype         
---  ------                                              -----         
 0   country_region_code                                 object        
 1   country_region                                      object        
 2   sub_region_1                                        object        
 3   sub_region_2                                        object        
 4   metro_area                                          object        
 5   iso_3166_2_code                                     object        
 6   census_fips_code                                    float64       
 7   place_id                                            object        
 8   date                                                datetime64[ns]
 9   retail_and_recreation_percent_change_from_baseline  float64       
 10  grocery_and_pharmacy_percent_change_from_baseline   float64       
 11  parks_percent_change_from_baseline                  float64       
 12  transit_stations_percent_change_from_baseline       float64       
 13  workplaces_percent_change_from_baseline             float64       
 14  residential_percent_change_from_baseline            float64       
dtypes: datetime64[ns](1), float64(7), object(7)
memory usage: 1.1+ GB

Among other information, displayed are labels of the columns and the data type of each column. For example ‘float64’ refers to a floating point number, i.e., a number with a decimal point. In the Community Mobility Reports, floating point numbers are used to display the percentage mobility change for each category of places.

Accessing columns and rows in your data¶

Accessing columns¶

You can can access columns via column name and column position.

Accessing columns via column name

Let’s access a single column.

# Get the country column and save it to its own variable
# The double square bracket option `[[]]` gives DataFrame

mobility_trends_countries = mobility_trends[["country_region"]]
mobility_trends_countries.head()

	country_region
0	United Arab Emirates
1	United Arab Emirates
2	United Arab Emirates
3	United Arab Emirates
4	United Arab Emirates

# Display the type of data structure

type(mobility_trends_countries)

pandas.core.frame.DataFrame

# The single square braket `[]` option gives Series
# pandas Series is a one-dimensional data structure or simply put, a single column

mobility_trends_countries_series = mobility_trends["country_region"].head()

# Display the type of data structure
type(mobility_trends_countries_series)

pandas.core.series.Series

Let’s now access multiple columns. We can access multiple columns by using a Python list. We use square brakets to create a list, for example ['country_region', 'sub_region_1', 'date']. Python lists are useful as they store multiple items in the same variable.

mobility_trends[["country_region", "sub_region_1", "date"]].head()

	country_region	sub_region_1	date
0	United Arab Emirates	NaN	2020-02-15
1	United Arab Emirates	NaN	2020-02-16
2	United Arab Emirates	NaN	2020-02-17
3	United Arab Emirates	NaN	2020-02-18
4	United Arab Emirates	NaN	2020-02-19

Accessing columns via column position

# Accessing columns 1, 2, adn 7 via column position and show the top five rows

mobility_trends.iloc[:, [1, 2, 7]].head()

	country_region	sub_region_1	place_id
0	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
1	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
2	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
3	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
4	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM

# Accessing a subset of rows and columns. We access the top 3 rows and columns 1, 2, and 7

mobility_trends.iloc[0:3, [1, 2, 7]].head()

	country_region	sub_region_1	place_id
0	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
1	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM
2	United Arab Emirates	NaN	ChIJvRKrsd9IXj4RpwoIwFYv0zM

Hands-on exercise¶

Create a new code cell.
Access the first 100 rows and the following columns from the DataFrame mobility_trends:
- country_region_code
- date
- retail_recreation
Assign the output to a variable called mobility_trends_subset.

Accessing rows¶

Rows can be accessed via row labels df.loc and row index df.iloc.

# Before accessing particular rows, let's see the names of all countries in the dataset.
# We can achieve that by listing all unique values in the 'country_region' column.
mobility_trends.country_region.unique()

array(['United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda',
       'Angola', 'Argentina', 'Austria', 'Australia', 'Aruba',
       'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium',
       'Burkina Faso', 'Bulgaria', 'Bahrain', 'Benin', 'Bolivia',
       'Brazil', 'The Bahamas', 'Botswana', 'Belarus', 'Belize', 'Canada',
       'Switzerland', "Côte d'Ivoire", 'Chile', 'Cameroon', 'Colombia',
       'Costa Rica', 'Cape Verde', 'Czechia', 'Germany', 'Denmark',
       'Dominican Republic', 'Ecuador', 'Estonia', 'Egypt', 'Spain',
       'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia',
       'Ghana', 'Greece', 'Guatemala', 'Guinea-Bissau', 'Hong Kong',
       'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland',
       'Israel', 'India', 'Iraq', 'Italy', 'Jamaica', 'Jordan', 'Japan',
       'Kenya', 'Kyrgyzstan', 'Cambodia', 'South Korea', 'Kuwait',
       'Kazakhstan', 'Laos', 'Lebanon', 'Liechtenstein', 'Sri Lanka',
       'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Moldova',
       'North Macedonia', 'Mali', 'Myanmar (Burma)', 'Mongolia', 'Malta',
       'Mauritius', 'Mexico', 'Malaysia', 'Mozambique', 'Namibia',
       'Niger', 'Nigeria', 'Nicaragua', 'Netherlands', 'Norway', 'Nepal',
       'New Zealand', 'Oman', 'Panama', 'Peru', 'Papua New Guinea',
       'Philippines', 'Pakistan', 'Poland', 'Puerto Rico', 'Portugal',
       'Paraguay', 'Qatar', 'Réunion', 'Romania', 'Serbia', 'Russia',
       'Rwanda', 'Saudi Arabia', 'Sweden', 'Singapore', 'Slovenia',
       'Slovakia', 'Senegal', 'El Salvador', 'Togo', 'Thailand',
       'Tajikistan', 'Turkey', 'Trinidad and Tobago', 'Taiwan',
       'Tanzania', 'Ukraine', 'Uganda', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'Yemen', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)

# Accessing specific rows from a DataFrame.
# We are interested in the data about the United Kingdom.

mobility_trends_UK = mobility_trends[
    mobility_trends["country_region"] == "United Kingdom"
]

mobility_trends_UK.head()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3647576	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-15	-12.0	-7.0	-35.0	-12.0	-4.0	2.0
3647577	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-16	-7.0	-6.0	-28.0	-7.0	-3.0	1.0
3647578	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-17	10.0	1.0	24.0	-2.0	-14.0	2.0
3647579	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-18	7.0	-1.0	20.0	-3.0	-14.0	2.0
3647580	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-19	6.0	-2.0	8.0	-4.0	-14.0	3.0

# Accessing data about multiple countries using the isin() method.
mobility_trends_countries = mobility_trends[
    mobility_trends["country_region"].isin(
        ["United Kingdom", "Germany", "Italy", "Sweden"]
    )
]

mobility_trends_countries.head()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3174838	DE	Germany	NaN	NaN	NaN	NaN	NaN	ChIJa76xwh5ymkcRW-WRjmtd6HU	2020-02-15	6.0	1.0	45.0	10.0	0.0	-1.0
3174839	DE	Germany	NaN	NaN	NaN	NaN	NaN	ChIJa76xwh5ymkcRW-WRjmtd6HU	2020-02-16	7.0	10.0	9.0	6.0	-1.0	0.0
3174840	DE	Germany	NaN	NaN	NaN	NaN	NaN	ChIJa76xwh5ymkcRW-WRjmtd6HU	2020-02-17	2.0	2.0	7.0	1.0	-2.0	0.0
3174841	DE	Germany	NaN	NaN	NaN	NaN	NaN	ChIJa76xwh5ymkcRW-WRjmtd6HU	2020-02-18	2.0	2.0	10.0	1.0	-1.0	1.0
3174842	DE	Germany	NaN	NaN	NaN	NaN	NaN	ChIJa76xwh5ymkcRW-WRjmtd6HU	2020-02-19	3.0	0.0	6.0	-1.0	-1.0	1.0

# Filter by two conditions — country and county — simultenously
# First let's see the list of counties in the dataset

mobility_trends[
    mobility_trends["country_region"] == "United Kingdom"
].sub_region_1.unique()

array([nan, 'Aberdeen City', 'Aberdeenshire', 'Angus Council',
       'Antrim and Newtownabbey', 'Ards and North Down',
       'Argyll and Bute Council', 'Armagh City, Banbridge and Craigavon',
       'Bath and North East Somerset', 'Bedford', 'Belfast',
       'Blackburn with Darwen', 'Blackpool', 'Blaenau Gwent',
       'Borough of Halton', 'Bracknell Forest', 'Bridgend County Borough',
       'Brighton and Hove', 'Bristol City', 'Buckinghamshire',
       'Caerphilly County Borough', 'Cambridgeshire', 'Cardiff',
       'Carmarthenshire', 'Causeway Coast and Glens',
       'Central Bedfordshire', 'Ceredigion', 'Cheshire East',
       'Cheshire West and Chester', 'Clackmannanshire',
       'Conwy Principal Area', 'Cornwall', 'County Durham', 'Cumbria',
       'Darlington', 'Denbighshire', 'Derby', 'Derbyshire',
       'Derry and Strabane', 'Devon', 'Dorset', 'Dumfries and Galloway',
       'Dundee City Council', 'East Ayrshire Council',
       'East Dunbartonshire Council', 'East Lothian Council',
       'East Renfrewshire Council', 'East Riding of Yorkshire',
       'East Sussex', 'Edinburgh', 'Essex', 'Falkirk',
       'Fermanagh and Omagh', 'Fife', 'Flintshire', 'Glasgow City',
       'Gloucestershire', 'Greater London', 'Greater Manchester',
       'Gwynedd', 'Hampshire', 'Hartlepool', 'Herefordshire',
       'Hertfordshire', 'Highland Council', 'Inverclyde',
       'Isle of Anglesey', 'Isle of Wight', 'Kent', 'Kingston upon Hull',
       'Lancashire', 'Leicester', 'Leicestershire', 'Lincolnshire',
       'Lisburn and Castlereagh', 'Luton', 'Medway', 'Merseyside',
       'Merthyr Tydfil County Borough', 'Mid and East Antrim',
       'Mid Ulster', 'Middlesbrough', 'Midlothian', 'Milton Keynes',
       'Monmouthshire', 'Moray', 'Na h-Eileanan an Iar',
       'Neath Port Talbot Principle Area', 'Newport',
       'Newry, Mourne and Down', 'Norfolk', 'North Ayrshire Council',
       'North East Lincolnshire', 'North Lanarkshire',
       'North Lincolnshire', 'North Somerset', 'North Yorkshire',
       'Northamptonshire', 'Northumberland', 'Nottingham',
       'Nottinghamshire', 'Orkney', 'Oxfordshire', 'Pembrokeshire',
       'Perth and Kinross', 'Peterborough', 'Plymouth', 'Portsmouth',
       'Powys', 'Reading', 'Redcar and Cleveland', 'Renfrewshire',
       'Rhondda Cynon Taff', 'Rutland', 'Scottish Borders',
       'Shetland Islands', 'Shropshire', 'Slough', 'Somerset',
       'South Ayrshire Council', 'South Gloucestershire',
       'South Lanarkshire', 'South Yorkshire', 'Southampton',
       'Southend-on-Sea', 'Staffordshire', 'Stirling', 'Stockton-on-Tees',
       'Stoke-on-Trent', 'Suffolk', 'Surrey', 'Swansea', 'Swindon',
       'Thurrock', 'Torbay', 'Torfaen Principal Area', 'Tyne and Wear',
       'Vale of Glamorgan', 'Warrington', 'Warwickshire',
       'West Berkshire', 'West Dunbartonshire Council', 'West Lothian',
       'West Midlands', 'West Sussex', 'West Yorkshire', 'Wiltshire',
       'Windsor and Maidenhead', 'Wokingham', 'Worcestershire',
       'Wrexham Principal Area', 'York'], dtype=object)

# Access data about UK and Essex
mobility_trends_UK_Essex = mobility_trends[
    (mobility_trends["country_region"] == "United Kingdom")
    & (mobility_trends["sub_region_1"] == "Essex")
]

mobility_trends_UK_Essex.head()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3720375	GB	United Kingdom	Essex	NaN	NaN	GB-ESS	NaN	ChIJ0w2H_idW2EcReVDuRzjLV0I	2020-02-15	-9.0	-7.0	-27.0	-11.0	-3.0	2.0
3720376	GB	United Kingdom	Essex	NaN	NaN	GB-ESS	NaN	ChIJ0w2H_idW2EcReVDuRzjLV0I	2020-02-16	-16.0	-12.0	-45.0	-9.0	-5.0	2.0
3720377	GB	United Kingdom	Essex	NaN	NaN	GB-ESS	NaN	ChIJ0w2H_idW2EcReVDuRzjLV0I	2020-02-17	9.0	-1.0	23.0	-1.0	-16.0	3.0
3720378	GB	United Kingdom	Essex	NaN	NaN	GB-ESS	NaN	ChIJ0w2H_idW2EcReVDuRzjLV0I	2020-02-18	10.0	-4.0	15.0	0.0	-16.0	2.0
3720379	GB	United Kingdom	Essex	NaN	NaN	GB-ESS	NaN	ChIJ0w2H_idW2EcReVDuRzjLV0I	2020-02-19	9.0	-3.0	10.0	1.0	-16.0	2.0

In the above example, you could simply type in

mobility_trends[mobility_trends['sub_region_1']=='Essex']

The command will return the same result as there is no region called ‘Essex’ outside the UK. However, there are instances in the data where regions in different countries have the same name. When this is the case, not specifying the country would return confusing outputs.

Accessing multiple rows and columns and conditioning¶

mobility_trends_UK

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3647576	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-15	-12.0	-7.0	-35.0	-12.0	-4.0	2.0
3647577	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-16	-7.0	-6.0	-28.0	-7.0	-3.0	1.0
3647578	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-17	10.0	1.0	24.0	-2.0	-14.0	2.0
3647579	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-18	7.0	-1.0	20.0	-3.0	-14.0	2.0
3647580	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-19	6.0	-2.0	8.0	-4.0	-14.0	3.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3990325	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-12	-3.0	10.0	91.0	-20.0	-25.0	5.0
3990326	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-13	-9.0	8.0	71.0	-12.0	-26.0	5.0
3990327	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-14	-15.0	6.0	23.0	-7.0	-5.0	-1.0
3990328	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-15	-8.0	9.0	7.0	-10.0	-5.0	0.0
3990329	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-16	-5.0	9.0	12.0	-24.0	-30.0	6.0

342754 rows × 15 columns

# Let's see which UK counties had the lower retail and recreation mobility
# the day after Italy went in lockdown on 9 March 2020
mobility_trends_UK_10MARCH2020 = mobility_trends_UK.loc[
    (mobility_trends_UK["date"] == "2020-03-10")
    & (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
    ["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]

# Sort in decreasing order
mobility_trends_UK_10MARCH2020.sort_values(
    by="retail_and_recreation_percent_change_from_baseline", ascending=True
)

	sub_region_1	retail_and_recreation_percent_change_from_baseline
3739259	Gloucestershire	-16.0
3741716	Greater London	-14.0
3763910	Greater London	-13.0
3742538	Greater London	-12.0
3758978	Greater London	-11.0
...	...	...
3948491	Tyne and Wear	-1.0
3885110	Nottinghamshire	-1.0
3772952	Greater Manchester	-1.0
3930454	Suffolk	-1.0
3803289	Kent	-1.0

167 rows × 2 columns

# UK counties with the lower retail and recreation
# the day after UK went in lockdown on 23 March 2020
mobility_trends_UK_24MARCH2020 = mobility_trends_UK.loc[
    (mobility_trends_UK["date"] == "2020-03-24")
    & (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
    ["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]

# Sort in decreasing order
mobility_trends_UK_24MARCH2020.sort_values(
    by="retail_and_recreation_percent_change_from_baseline", ascending=True
)

	sub_region_1	retail_and_recreation_percent_change_from_baseline
3741730	Greater London	-94.0
3742552	Greater London	-92.0
3747484	Greater London	-87.0
3767212	Greater London	-84.0
3756526	Greater London	-83.0
...	...	...
3969853	West Sussex	-56.0
3686799	Cumbria	-56.0
3808235	Kent	-54.0
3809879	Kent	-54.0
3842701	Merseyside	-50.0

417 rows × 2 columns

Let’s identify the places in the UK with the highest reduction of workplaces mobility.

# We need to first find the index of the row
# with the highest reduction of workplaces mobility
mobility_trends_UK_workplaces = mobility_trends_UK[
    "workplaces_percent_change_from_baseline"
].idxmin()

mobility_trends_UK_workplaces

# Use the above index to access the row
# with the highest reduction of workplaces mobility
mobility_trends.iloc[[mobility_trends_UK_workplaces]]

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3756546	GB	United Kingdom	Greater London	London Borough of Islington	NaN	GB-ISL	NaN	ChIJ5Tvr7V0bdkgRyWXsGMuF8FQ	2020-04-13	-88.0	-51.0	-28.0	-90.0	-91.0	40.0

In our basic exploratory data analysis, we found that as of 21 April 2021 the largest percentage mobility change in workplaces was in the London Borough of Islington, Greater London, on 13 April 2020 when the workplaces mobility was reduced with 91 per cent compared to a baseline period. This result is only for exploratory purposes. Note that the same mobility in two different days would result in different percentage changes due to the way in which actual mobility is compared to a baseline in the Google Covid-19 Community Mobility Reports.

Handling Missing Values¶

Many data sets have missing values and the mobility data is not an exception. Handling missing values is a very common problem in data analysis. We will cover two simple ways of handling missing values in data:

Deletion — missing observations are simply dropped from the data set.
- Likewise - removes all data for an observation that has one or more missing values.
- Pairwise — removes observations with missing values only for the variables used in the current procedure (e.g., model or technique).
Imputation — missing observations are replaced with inferred values. We will cover simple imputation methods in which missing values are replaced with column mean or row mean.

For more advanced methods and functionality, including interpolation methods, see the section ‘Working with missing data’ in Pandas documentation.

Detecting missing values¶

In pandas, missing values are represented as NaN, which stands for Not a Number. To detect missing values, we can use the function isna() (other functions, for example isnull() can be used to the same effect). The function returns boolean values (False, True) indicating whether each corresponding value in a DataFrame is missing.

mobility_trends.isna()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
0	False	False	True	True	True	True	True	False	False	False	False	False	False	False	False
1	False	False	True	True	True	True	True	False	False	False	False	False	False	False	False
2	False	False	True	True	True	True	True	False	False	False	False	False	False	False	False
3	False	False	True	True	True	True	True	False	False	False	False	False	False	False	False
4	False	False	True	True	True	True	True	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9900501	False	False	False	False	True	True	True	False	False	True	True	True	True	False	True
9900502	False	False	False	False	True	True	True	False	False	True	True	True	True	False	True
9900503	False	False	False	False	True	True	True	False	False	True	True	True	True	False	True
9900504	False	False	False	False	True	True	True	False	False	True	True	True	True	False	True
9900505	False	False	False	False	True	True	True	False	False	True	True	True	True	False	True

9900506 rows × 15 columns

Let’s determine the number of missing values for each column in the DataFrame.

mobility_trends.isna().sum()

country_region_code                                      6183
country_region                                              0
sub_region_1                                           163754
sub_region_2                                          1610543
metro_area                                            9847293
iso_3166_2_code                                       8173140
census_fips_code                                      7793263
place_id                                                18182
date                                                        0
retail_and_recreation_percent_change_from_baseline    3769374
grocery_and_pharmacy_percent_change_from_baseline     4048994
parks_percent_change_from_baseline                    5217709
transit_stations_percent_change_from_baseline         4988015
workplaces_percent_change_from_baseline                354374
residential_percent_change_from_baseline              3846476
dtype: int64

It would be useful to quantify the percentage of values that are missing per column. To achieve this, we simply divide the number of missing values in a column by the number of rows in the DataFrame (and multiply the result by 100).

mobility_trends.isna().sum() / mobility_trends.shape[0] * 100

country_region_code                                    0.062451
country_region                                         0.000000
sub_region_1                                           1.653996
sub_region_2                                          16.267280
metro_area                                            99.462522
iso_3166_2_code                                       82.552750
census_fips_code                                      78.715805
place_id                                               0.183647
date                                                   0.000000
retail_and_recreation_percent_change_from_baseline    38.072539
grocery_and_pharmacy_percent_change_from_baseline     40.896839
parks_percent_change_from_baseline                    52.701438
transit_stations_percent_change_from_baseline         50.381415
workplaces_percent_change_from_baseline                3.579352
residential_percent_change_from_baseline              38.851307
dtype: float64

Let’s now repeat the procedure for the UK mobility data only:

mobility_trends_UK.isna().sum() / mobility_trends_UK.shape[0] * 100

country_region_code                                     0.000000
country_region                                          0.000000
sub_region_1                                            0.239822
sub_region_2                                           36.107237
metro_area                                            100.000000
iso_3166_2_code                                        49.442166
census_fips_code                                      100.000000
place_id                                                0.000000
date                                                    0.000000
retail_and_recreation_percent_change_from_baseline      2.760289
grocery_and_pharmacy_percent_change_from_baseline       2.684724
parks_percent_change_from_baseline                     21.118353
transit_stations_percent_change_from_baseline           3.234390
workplaces_percent_change_from_baseline                 0.518739
residential_percent_change_from_baseline                5.692713
dtype: float64

The output shows that the missing values in most mobility categories are below 5 per cent except the missing values for residential mobility (8%) and parks mobility (21.7%).

Removing missing values¶

We can remove missing values using the Pandas function dropna(). The default functionality removes rows that contain at least one missing value. The dropna() function provides other capabilities as well. For example, it is possible to remove columns (instead of rows) that contain missing values by specifying axis = 1 within the brackets. Also, we can specify under what conditions rows or columns are removed: use how = any if you would like to drop rows or columns that contain any missing value, which is the default functionality; use how = all if you would like to drop rows or columns in which all values are missing.

mobility_trends_UK.dropna()

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline

We did not specify particular columns on which to perform the dropna() function. Because we have a column (‘metro_area’) for which all values are missing, the dropna() function removed all observations from the DataFrame. This is analogous to a likewise deletion of missing values.

A more informative approach would be to specify the columns we are going to use in our analysis and drop missing values only for this subset of variables. We can drop rows that contain any missing values in the six mobility categories by passing the labels of those categories to the parameter subset as shown below:

mobility_trends_UK.dropna(subset=mobility_trends_UK.iloc[:, 9:15].columns)

	country_region_code	country_region	sub_region_1	sub_region_2	metro_area	iso_3166_2_code	census_fips_code	place_id	date	retail_and_recreation_percent_change_from_baseline	grocery_and_pharmacy_percent_change_from_baseline	parks_percent_change_from_baseline	transit_stations_percent_change_from_baseline	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3647576	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-15	-12.0	-7.0	-35.0	-12.0	-4.0	2.0
3647577	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-16	-7.0	-6.0	-28.0	-7.0	-3.0	1.0
3647578	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-17	10.0	1.0	24.0	-2.0	-14.0	2.0
3647579	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-18	7.0	-1.0	20.0	-3.0	-14.0	2.0
3647580	GB	United Kingdom	NaN	NaN	NaN	NaN	NaN	ChIJqZHHQhE7WgIReiWIMkOg-MQ	2020-02-19	6.0	-2.0	8.0	-4.0	-14.0	3.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3990325	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-12	-3.0	10.0	91.0	-20.0	-25.0	5.0
3990326	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-13	-9.0	8.0	71.0	-12.0	-26.0	5.0
3990327	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-14	-15.0	6.0	23.0	-7.0	-5.0	-1.0
3990328	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-15	-8.0	9.0	7.0	-10.0	-5.0	0.0
3990329	GB	United Kingdom	York	NaN	NaN	GB-YOR	NaN	ChIJh-IigLwxeUgRAKFv7Z75DAM	2022-05-16	-5.0	9.0	12.0	-24.0	-30.0	6.0

255595 rows × 15 columns

mobility_trends_UK.shape

(342754, 15)

After removing the rows that contain missing values in the six mobility categories/columns, we are left with 137992 rows out of 191064 rows, a reduction of approximately 28%.

Replacing missing values¶

Let’s replace NaN values with the mean of non-missing values. For example, we can replace missing values in the columns ‘workplaces_percent_change_from_baseline’ and ‘residential_percent_change_from_baseline’ by the mean of values in the two columns as determined by the Pandas mean() method.

# Replace the NaNs in the columns 'workplaces_percent_change_from_baseline'
# and 'residential_percent_change_from_baseline'
# by the mean of the non-missing values.
mobility_trends_UK[
    [
        "workplaces_percent_change_from_baseline",
        "residential_percent_change_from_baseline",
    ]
].fillna(
    value=mobility_trends_UK[
        [
            "workplaces_percent_change_from_baseline",
            "residential_percent_change_from_baseline",
        ]
    ].mean()
)

	workplaces_percent_change_from_baseline	residential_percent_change_from_baseline
3647576	-4.0	2.0
3647577	-3.0	1.0
3647578	-14.0	2.0
3647579	-14.0	2.0
3647580	-14.0	3.0
...	...	...
3990325	-25.0	5.0
3990326	-26.0	5.0
3990327	-5.0	-1.0
3990328	-5.0	0.0
3990329	-30.0	6.0

342754 rows × 2 columns

We now have the same number of rows (191064) as the original mobility_trends_UK DataFrame, with each missing value replaced by the mean value for the respective column. More advanced (and practical) approaches of replacing missing values can take into account groupings in the data and replace missing values with the mean value of a group of counties or the mean value of a time period; those operations would require more advanced Pandas functionality such as groupby(), which we will use in the next lab.

Compare mobility data formats¶

A related aggregate, privacy-preserving, and freely available mobility data set is the Apple’s Mobility Trends Reports. The data is updated daily and reflect in aggregate requests for directions in Apple Maps for select countries/regions, sub-regions and cities.

The Mobility Trends Reports data can be downloaded from the website https://covid19.apple.com/mobility. However, because the data file is not at a stable location (URL), we cannot load the data from the website but instead access the data from a COVID-19 mobility data aggregator on GitHub.

mobility_trends_covid_apple = pd.read_csv(
    "https://raw.githubusercontent.com/ActiveConclusion/COVID19_mobility/master/apple_reports/applemobilitytrends.csv"
)
mobility_trends_covid_apple.head()

	geo_type	region	transportation_type	alternative_name	sub-region	country	2020-01-13	2020-01-14	2020-01-15	2020-01-16	...	2022-04-03	2022-04-04	2022-04-05	2022-04-06	2022-04-07	2022-04-08	2022-04-09	2022-04-10	2022-04-11	2022-04-12
0	country/region	Albania	driving	NaN	NaN	NaN	100.0	95.30	101.43	97.20	...	149.58	156.76	156.76	151.02	157.86	185.12	196.62	173.11	173.21	171.09
1	country/region	Albania	walking	NaN	NaN	NaN	100.0	100.68	98.93	98.46	...	161.71	215.31	221.62	211.34	189.25	244.22	231.90	154.93	223.97	223.45
2	country/region	Argentina	driving	NaN	NaN	NaN	100.0	97.07	102.45	111.21	...	93.32	97.74	102.63	106.26	111.81	145.45	158.56	99.20	100.82	114.56
3	country/region	Argentina	walking	NaN	NaN	NaN	100.0	95.11	101.37	112.67	...	88.58	126.43	133.06	135.64	142.40	178.26	172.83	98.96	126.57	139.29
4	country/region	Australia	driving	AU	NaN	NaN	100.0	102.98	104.21	108.63	...	123.95	116.32	132.29	134.97	145.09	146.34	125.79	136.60	139.01	146.75

5 rows × 827 columns

Activity: Discussing data formats

Compare the format of the data in the Apple Mobility Trends Reports to the format of the data in the Google COVID-19 Community Mobility Reports and in Our World in Data COVID-19 data. Focus in particular on similarities and differences in how time series data is represented across data sets.

Recording dependencies¶

# Install the watermark extension
!pip install -q watermark

# Load the watermark extension
%load_ext watermark

# Show packages that were imported
%watermark --iversions

pandas: 1.1.3

Save a list of the packages (and their versions) used in the current notebook to a file named requirements.txt.

# Run pipreqsnb after specifying the path to the notebook
!pipreqsnb ../notebooks/04_data_design_and_data_wrangling.ipynb

pipreqs  --savepath ../notebooks/requirements.txt .//__temp_pipreqsnb_folder/
INFO: Successfully saved requirements file in ../notebooks/requirements.txt