Data Design & Data Wrangling¶
‘Data wrangling’ is the mundane labor of collecting, loading, transforming, and preparing data for exploration and analysis. Although it is hard to put a number on it, estimates indicate that data scientists spend from 50 per cent to 80 per cent of their time on data wrangling. The importance of data wrangling has been recognised in social sciencies as well.
Although data wrangling may seem a-theoretical, some of the decisions you would make would in fact depend on how the data were generated or designed.
In this lab, we will first discuss issues related to data design and will then practice hands-on data wrangling with pandas
and other Python libraries for data analysis. We will wrangle open data sets, including the Google Covid-19 Community Mobility Reports and Apple Mobility Trends Reports.
Key themes¶
What are data? How are data generated (i.e., data design)?
Data wrangling with
pandas
: loading, selecting, and transforming data.
Learning resources¶
Matthew Salganik. 2.3 Ten common characteristics of big data. In Bit by Bit: Social Research in the Digital Age. [Online version freely available]
Lisa Gitelman and Virginia Jackson. “Raw Data” Is an Oxymoron. Introduction. MIT Press.
Sabina Leonelli. Scientific Research and Big Data. The Stanford Encyclopedia of Philosophy.
Sam Lau, Joey Gonzalez, and Deb Nolan. The Data Science Lifecycle. In Principles and Techniques of Data Science.
Steve Lohr. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times.
Alexander Kindel et al. Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge. Socius.
Introduction to Data Processing in Python with Pandas, SciPy 2019 Tutorial by Daniel Chen. See also freely available Jupyter notebooks: 01-intro.ipynb and 02-tidy.ipynb.
Activity: Discussing data biases
After reading the section Ten common characteristics of big data in Matthew Salganik’s book Bit by Bit, we will discuss how each characteristic of big data applies to the digital data sets we are analysing in this and next sections, including the Google COVID-19 Community Mobility Reports and Apple Mobility Trends Reports.
Get started with pandas
¶
The pandas
library:
is a fast, powerful, and flexible open source tool for doing real world data analysis in Python.
offers a diverse range of high-performance tools for data loading, cleaning, wrangling, merging, reshaping, and summarising.
is the go-to data science library in Python.
Importing pandas
¶
To use pandas
, we first import the library via the Python’s import
command and give it the alias pd:
import pandas as pd
pd.__version__
# Suppress warnings to avoid potential confusion
import warnings
warnings.filterwarnings("ignore")
Loading your data¶
Pandas supports many data file formats, including csv, excel, sql, json.
For details, see How do I read and write tabular data? For example, once you import pandas as pd, you can load a .csv file using the function pd.read_csv()
, an Excel file using the function pd.read_excel()
, and a JSON file using the function pd.read_json()
.
Loading Covid-19 data sets via URL¶
What are the Google Covid-19 Community Mobility Reports?
Aggregated and anonymized sets of freely available data that protect individual privacy.
Shows trends of human mobility over time by country and region, across six categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.
The data shows how visitors to (or time spent in) categorized places change compared to baseline days. A baseline day represents a normal value for that day of the week. The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.
An overview of the data from the mobility data can be found here.
The code below loads the most recent version of the Covid-19 Google Community Mobility Reports as a pandas
DataFrame object and assign the object to a variable called ‘mobility_trends’
mobility_trends = pd.read_csv(
"https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv",
parse_dates=["date"],
)
mobility_trends
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-15 | 0.0 | 4.0 | 5.0 | 0.0 | 2.0 | 1.0 |
1 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-16 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 | 1.0 |
2 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-17 | -1.0 | 1.0 | 5.0 | 1.0 | 2.0 | 1.0 |
3 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-18 | -2.0 | 1.0 | 5.0 | 0.0 | 2.0 | 1.0 |
4 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-19 | -2.0 | 0.0 | 4.0 | -1.0 | 2.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9900501 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-10 | NaN | NaN | NaN | NaN | 126.0 | NaN |
9900502 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-11 | NaN | NaN | NaN | NaN | 129.0 | NaN |
9900503 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-12 | NaN | NaN | NaN | NaN | 116.0 | NaN |
9900504 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-13 | NaN | NaN | NaN | NaN | 118.0 | NaN |
9900505 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-16 | NaN | NaN | NaN | NaN | 100.0 | NaN |
9900506 rows × 15 columns
Pandas
load the data as a DataFrame object. We can confirm this by using the built-in function type()
:
type(mobility_trends)
pandas.core.frame.DataFrame
Viewing, Describing, and Accessing your Data¶
Once we have our data as a DataFrame data structure, we can view, describe, and access the data by applying basic functionalities associated with that structure. For example, we can perform various operations by using the so called methods
. Examples of pandas
methods are head()
and tail()
; head()
displays by default the top five rows of the data and tail()
displays by default the last five rows. You can display a custom number of row by passing that number in the brackets.
View your data¶
# Show the first five rows using the method DataFrame.head()
mobility_trends.head()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-15 | 0.0 | 4.0 | 5.0 | 0.0 | 2.0 | 1.0 |
1 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-16 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 | 1.0 |
2 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-17 | -1.0 | 1.0 | 5.0 | 1.0 | 2.0 | 1.0 |
3 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-18 | -2.0 | 1.0 | 5.0 | 0.0 | 2.0 | 1.0 |
4 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-19 | -2.0 | 0.0 | 4.0 | -1.0 | 2.0 | 1.0 |
# Show the last five rows using the method DataFrame.tail()
mobility_trends.tail()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9900501 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-10 | NaN | NaN | NaN | NaN | 126.0 | NaN |
9900502 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-11 | NaN | NaN | NaN | NaN | 129.0 | NaN |
9900503 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-12 | NaN | NaN | NaN | NaN | 116.0 | NaN |
9900504 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-13 | NaN | NaN | NaN | NaN | 118.0 | NaN |
9900505 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-16 | NaN | NaN | NaN | NaN | 100.0 | NaN |
# Specify the number of rows to return
mobility_trends.tail(10)
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9900496 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-03 | NaN | NaN | NaN | NaN | 119.0 | NaN |
9900497 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-04 | NaN | NaN | NaN | NaN | 116.0 | NaN |
9900498 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-05 | NaN | NaN | NaN | NaN | 110.0 | NaN |
9900499 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-06 | NaN | NaN | NaN | NaN | 124.0 | NaN |
9900500 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-09 | NaN | NaN | NaN | NaN | 98.0 | NaN |
9900501 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-10 | NaN | NaN | NaN | NaN | 126.0 | NaN |
9900502 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-11 | NaN | NaN | NaN | NaN | 129.0 | NaN |
9900503 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-12 | NaN | NaN | NaN | NaN | 116.0 | NaN |
9900504 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-13 | NaN | NaN | NaN | NaN | 118.0 | NaN |
9900505 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-16 | NaN | NaN | NaN | NaN | 100.0 | NaN |
Describing your DataFrame¶
# Dimensionality — number of rows and columns — of a DataFrame
mobility_trends.shape
(9900506, 15)
# Use the print function to display the number of rows and columns
# in a DataFrame at the time of cell execution.
# Import the 'datetime' module (used below to display the current date)
import datetime
print(
"\nAs of",
datetime.datetime.now(),
"The Google COVID-19 Community Mobility Reports contain",
mobility_trends.shape[0],
"rows and",
mobility_trends.shape[1],
"columns.",
)
As of 2022-05-19 17:59:52.163517 The Google COVID-19 Community Mobility Reports contain 9900506 rows and 15 columns.
# Accessing columns using the DataFrame.columns attribute
mobility_trends.columns
Index(['country_region_code', 'country_region', 'sub_region_1', 'sub_region_2',
'metro_area', 'iso_3166_2_code', 'census_fips_code', 'place_id', 'date',
'retail_and_recreation_percent_change_from_baseline',
'grocery_and_pharmacy_percent_change_from_baseline',
'parks_percent_change_from_baseline',
'transit_stations_percent_change_from_baseline',
'workplaces_percent_change_from_baseline',
'residential_percent_change_from_baseline'],
dtype='object')
Let’s display a concise summary of our DataFrame using the method info()
# Information about a DataFrame
mobility_trends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9900506 entries, 0 to 9900505
Data columns (total 15 columns):
# Column Dtype
--- ------ -----
0 country_region_code object
1 country_region object
2 sub_region_1 object
3 sub_region_2 object
4 metro_area object
5 iso_3166_2_code object
6 census_fips_code float64
7 place_id object
8 date datetime64[ns]
9 retail_and_recreation_percent_change_from_baseline float64
10 grocery_and_pharmacy_percent_change_from_baseline float64
11 parks_percent_change_from_baseline float64
12 transit_stations_percent_change_from_baseline float64
13 workplaces_percent_change_from_baseline float64
14 residential_percent_change_from_baseline float64
dtypes: datetime64[ns](1), float64(7), object(7)
memory usage: 1.1+ GB
Among other information, displayed are labels of the columns and the data type of each column. For example ‘float64’ refers to a floating point number, i.e., a number with a decimal point. In the Community Mobility Reports, floating point numbers are used to display the percentage mobility change for each category of places.
Accessing columns and rows in your data¶
Accessing columns¶
You can can access columns via column name and column position.
Accessing columns via column name
Let’s access a single column.
# Get the country column and save it to its own variable
# The double square bracket option `[[]]` gives DataFrame
mobility_trends_countries = mobility_trends[["country_region"]]
mobility_trends_countries.head()
country_region | |
---|---|
0 | United Arab Emirates |
1 | United Arab Emirates |
2 | United Arab Emirates |
3 | United Arab Emirates |
4 | United Arab Emirates |
# Display the type of data structure
type(mobility_trends_countries)
pandas.core.frame.DataFrame
# The single square braket `[]` option gives Series
# pandas Series is a one-dimensional data structure or simply put, a single column
mobility_trends_countries_series = mobility_trends["country_region"].head()
# Display the type of data structure
type(mobility_trends_countries_series)
pandas.core.series.Series
Let’s now access multiple columns. We can access multiple columns by using a Python list. We use square brakets to create a list, for example ['country_region', 'sub_region_1', 'date']
. Python lists are useful as they store multiple items in the same variable.
mobility_trends[["country_region", "sub_region_1", "date"]].head()
country_region | sub_region_1 | date | |
---|---|---|---|
0 | United Arab Emirates | NaN | 2020-02-15 |
1 | United Arab Emirates | NaN | 2020-02-16 |
2 | United Arab Emirates | NaN | 2020-02-17 |
3 | United Arab Emirates | NaN | 2020-02-18 |
4 | United Arab Emirates | NaN | 2020-02-19 |
Accessing columns via column position
# Accessing columns 1, 2, adn 7 via column position and show the top five rows
mobility_trends.iloc[:, [1, 2, 7]].head()
country_region | sub_region_1 | place_id | |
---|---|---|---|
0 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
1 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
2 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
3 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
4 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
# Accessing a subset of rows and columns. We access the top 3 rows and columns 1, 2, and 7
mobility_trends.iloc[0:3, [1, 2, 7]].head()
country_region | sub_region_1 | place_id | |
---|---|---|---|
0 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
1 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
2 | United Arab Emirates | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM |
Hands-on exercise¶
Create a new code cell.
Access the first 100 rows and the following columns from the DataFrame
mobility_trends
:country_region_code
date
retail_recreation
Assign the output to a variable called
mobility_trends_subset
.
Accessing rows¶
Rows can be accessed via row labels df.loc
and row index df.iloc
.
# Before accessing particular rows, let's see the names of all countries in the dataset.
# We can achieve that by listing all unique values in the 'country_region' column.
mobility_trends.country_region.unique()
array(['United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda',
'Angola', 'Argentina', 'Austria', 'Australia', 'Aruba',
'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium',
'Burkina Faso', 'Bulgaria', 'Bahrain', 'Benin', 'Bolivia',
'Brazil', 'The Bahamas', 'Botswana', 'Belarus', 'Belize', 'Canada',
'Switzerland', "Côte d'Ivoire", 'Chile', 'Cameroon', 'Colombia',
'Costa Rica', 'Cape Verde', 'Czechia', 'Germany', 'Denmark',
'Dominican Republic', 'Ecuador', 'Estonia', 'Egypt', 'Spain',
'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia',
'Ghana', 'Greece', 'Guatemala', 'Guinea-Bissau', 'Hong Kong',
'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland',
'Israel', 'India', 'Iraq', 'Italy', 'Jamaica', 'Jordan', 'Japan',
'Kenya', 'Kyrgyzstan', 'Cambodia', 'South Korea', 'Kuwait',
'Kazakhstan', 'Laos', 'Lebanon', 'Liechtenstein', 'Sri Lanka',
'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Moldova',
'North Macedonia', 'Mali', 'Myanmar (Burma)', 'Mongolia', 'Malta',
'Mauritius', 'Mexico', 'Malaysia', 'Mozambique', 'Namibia',
'Niger', 'Nigeria', 'Nicaragua', 'Netherlands', 'Norway', 'Nepal',
'New Zealand', 'Oman', 'Panama', 'Peru', 'Papua New Guinea',
'Philippines', 'Pakistan', 'Poland', 'Puerto Rico', 'Portugal',
'Paraguay', 'Qatar', 'Réunion', 'Romania', 'Serbia', 'Russia',
'Rwanda', 'Saudi Arabia', 'Sweden', 'Singapore', 'Slovenia',
'Slovakia', 'Senegal', 'El Salvador', 'Togo', 'Thailand',
'Tajikistan', 'Turkey', 'Trinidad and Tobago', 'Taiwan',
'Tanzania', 'Ukraine', 'Uganda', 'United States', 'Uruguay',
'Venezuela', 'Vietnam', 'Yemen', 'South Africa', 'Zambia',
'Zimbabwe'], dtype=object)
# Accessing specific rows from a DataFrame.
# We are interested in the data about the United Kingdom.
mobility_trends_UK = mobility_trends[
mobility_trends["country_region"] == "United Kingdom"
]
mobility_trends_UK.head()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3647576 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-15 | -12.0 | -7.0 | -35.0 | -12.0 | -4.0 | 2.0 |
3647577 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-16 | -7.0 | -6.0 | -28.0 | -7.0 | -3.0 | 1.0 |
3647578 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-17 | 10.0 | 1.0 | 24.0 | -2.0 | -14.0 | 2.0 |
3647579 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-18 | 7.0 | -1.0 | 20.0 | -3.0 | -14.0 | 2.0 |
3647580 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-19 | 6.0 | -2.0 | 8.0 | -4.0 | -14.0 | 3.0 |
# Accessing data about multiple countries using the isin() method.
mobility_trends_countries = mobility_trends[
mobility_trends["country_region"].isin(
["United Kingdom", "Germany", "Italy", "Sweden"]
)
]
mobility_trends_countries.head()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3174838 | DE | Germany | NaN | NaN | NaN | NaN | NaN | ChIJa76xwh5ymkcRW-WRjmtd6HU | 2020-02-15 | 6.0 | 1.0 | 45.0 | 10.0 | 0.0 | -1.0 |
3174839 | DE | Germany | NaN | NaN | NaN | NaN | NaN | ChIJa76xwh5ymkcRW-WRjmtd6HU | 2020-02-16 | 7.0 | 10.0 | 9.0 | 6.0 | -1.0 | 0.0 |
3174840 | DE | Germany | NaN | NaN | NaN | NaN | NaN | ChIJa76xwh5ymkcRW-WRjmtd6HU | 2020-02-17 | 2.0 | 2.0 | 7.0 | 1.0 | -2.0 | 0.0 |
3174841 | DE | Germany | NaN | NaN | NaN | NaN | NaN | ChIJa76xwh5ymkcRW-WRjmtd6HU | 2020-02-18 | 2.0 | 2.0 | 10.0 | 1.0 | -1.0 | 1.0 |
3174842 | DE | Germany | NaN | NaN | NaN | NaN | NaN | ChIJa76xwh5ymkcRW-WRjmtd6HU | 2020-02-19 | 3.0 | 0.0 | 6.0 | -1.0 | -1.0 | 1.0 |
# Filter by two conditions — country and county — simultenously
# First let's see the list of counties in the dataset
mobility_trends[
mobility_trends["country_region"] == "United Kingdom"
].sub_region_1.unique()
array([nan, 'Aberdeen City', 'Aberdeenshire', 'Angus Council',
'Antrim and Newtownabbey', 'Ards and North Down',
'Argyll and Bute Council', 'Armagh City, Banbridge and Craigavon',
'Bath and North East Somerset', 'Bedford', 'Belfast',
'Blackburn with Darwen', 'Blackpool', 'Blaenau Gwent',
'Borough of Halton', 'Bracknell Forest', 'Bridgend County Borough',
'Brighton and Hove', 'Bristol City', 'Buckinghamshire',
'Caerphilly County Borough', 'Cambridgeshire', 'Cardiff',
'Carmarthenshire', 'Causeway Coast and Glens',
'Central Bedfordshire', 'Ceredigion', 'Cheshire East',
'Cheshire West and Chester', 'Clackmannanshire',
'Conwy Principal Area', 'Cornwall', 'County Durham', 'Cumbria',
'Darlington', 'Denbighshire', 'Derby', 'Derbyshire',
'Derry and Strabane', 'Devon', 'Dorset', 'Dumfries and Galloway',
'Dundee City Council', 'East Ayrshire Council',
'East Dunbartonshire Council', 'East Lothian Council',
'East Renfrewshire Council', 'East Riding of Yorkshire',
'East Sussex', 'Edinburgh', 'Essex', 'Falkirk',
'Fermanagh and Omagh', 'Fife', 'Flintshire', 'Glasgow City',
'Gloucestershire', 'Greater London', 'Greater Manchester',
'Gwynedd', 'Hampshire', 'Hartlepool', 'Herefordshire',
'Hertfordshire', 'Highland Council', 'Inverclyde',
'Isle of Anglesey', 'Isle of Wight', 'Kent', 'Kingston upon Hull',
'Lancashire', 'Leicester', 'Leicestershire', 'Lincolnshire',
'Lisburn and Castlereagh', 'Luton', 'Medway', 'Merseyside',
'Merthyr Tydfil County Borough', 'Mid and East Antrim',
'Mid Ulster', 'Middlesbrough', 'Midlothian', 'Milton Keynes',
'Monmouthshire', 'Moray', 'Na h-Eileanan an Iar',
'Neath Port Talbot Principle Area', 'Newport',
'Newry, Mourne and Down', 'Norfolk', 'North Ayrshire Council',
'North East Lincolnshire', 'North Lanarkshire',
'North Lincolnshire', 'North Somerset', 'North Yorkshire',
'Northamptonshire', 'Northumberland', 'Nottingham',
'Nottinghamshire', 'Orkney', 'Oxfordshire', 'Pembrokeshire',
'Perth and Kinross', 'Peterborough', 'Plymouth', 'Portsmouth',
'Powys', 'Reading', 'Redcar and Cleveland', 'Renfrewshire',
'Rhondda Cynon Taff', 'Rutland', 'Scottish Borders',
'Shetland Islands', 'Shropshire', 'Slough', 'Somerset',
'South Ayrshire Council', 'South Gloucestershire',
'South Lanarkshire', 'South Yorkshire', 'Southampton',
'Southend-on-Sea', 'Staffordshire', 'Stirling', 'Stockton-on-Tees',
'Stoke-on-Trent', 'Suffolk', 'Surrey', 'Swansea', 'Swindon',
'Thurrock', 'Torbay', 'Torfaen Principal Area', 'Tyne and Wear',
'Vale of Glamorgan', 'Warrington', 'Warwickshire',
'West Berkshire', 'West Dunbartonshire Council', 'West Lothian',
'West Midlands', 'West Sussex', 'West Yorkshire', 'Wiltshire',
'Windsor and Maidenhead', 'Wokingham', 'Worcestershire',
'Wrexham Principal Area', 'York'], dtype=object)
# Access data about UK and Essex
mobility_trends_UK_Essex = mobility_trends[
(mobility_trends["country_region"] == "United Kingdom")
& (mobility_trends["sub_region_1"] == "Essex")
]
mobility_trends_UK_Essex.head()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3720375 | GB | United Kingdom | Essex | NaN | NaN | GB-ESS | NaN | ChIJ0w2H_idW2EcReVDuRzjLV0I | 2020-02-15 | -9.0 | -7.0 | -27.0 | -11.0 | -3.0 | 2.0 |
3720376 | GB | United Kingdom | Essex | NaN | NaN | GB-ESS | NaN | ChIJ0w2H_idW2EcReVDuRzjLV0I | 2020-02-16 | -16.0 | -12.0 | -45.0 | -9.0 | -5.0 | 2.0 |
3720377 | GB | United Kingdom | Essex | NaN | NaN | GB-ESS | NaN | ChIJ0w2H_idW2EcReVDuRzjLV0I | 2020-02-17 | 9.0 | -1.0 | 23.0 | -1.0 | -16.0 | 3.0 |
3720378 | GB | United Kingdom | Essex | NaN | NaN | GB-ESS | NaN | ChIJ0w2H_idW2EcReVDuRzjLV0I | 2020-02-18 | 10.0 | -4.0 | 15.0 | 0.0 | -16.0 | 2.0 |
3720379 | GB | United Kingdom | Essex | NaN | NaN | GB-ESS | NaN | ChIJ0w2H_idW2EcReVDuRzjLV0I | 2020-02-19 | 9.0 | -3.0 | 10.0 | 1.0 | -16.0 | 2.0 |
In the above example, you could simply type in
mobility_trends[mobility_trends['sub_region_1']=='Essex']
The command will return the same result as there is no region called ‘Essex’ outside the UK. However, there are instances in the data where regions in different countries have the same name. When this is the case, not specifying the country would return confusing outputs.
Accessing multiple rows and columns and conditioning¶
mobility_trends_UK
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3647576 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-15 | -12.0 | -7.0 | -35.0 | -12.0 | -4.0 | 2.0 |
3647577 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-16 | -7.0 | -6.0 | -28.0 | -7.0 | -3.0 | 1.0 |
3647578 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-17 | 10.0 | 1.0 | 24.0 | -2.0 | -14.0 | 2.0 |
3647579 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-18 | 7.0 | -1.0 | 20.0 | -3.0 | -14.0 | 2.0 |
3647580 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-19 | 6.0 | -2.0 | 8.0 | -4.0 | -14.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3990325 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-12 | -3.0 | 10.0 | 91.0 | -20.0 | -25.0 | 5.0 |
3990326 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-13 | -9.0 | 8.0 | 71.0 | -12.0 | -26.0 | 5.0 |
3990327 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-14 | -15.0 | 6.0 | 23.0 | -7.0 | -5.0 | -1.0 |
3990328 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-15 | -8.0 | 9.0 | 7.0 | -10.0 | -5.0 | 0.0 |
3990329 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-16 | -5.0 | 9.0 | 12.0 | -24.0 | -30.0 | 6.0 |
342754 rows × 15 columns
# Let's see which UK counties had the lower retail and recreation mobility
# the day after Italy went in lockdown on 9 March 2020
mobility_trends_UK_10MARCH2020 = mobility_trends_UK.loc[
(mobility_trends_UK["date"] == "2020-03-10")
& (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]
# Sort in decreasing order
mobility_trends_UK_10MARCH2020.sort_values(
by="retail_and_recreation_percent_change_from_baseline", ascending=True
)
sub_region_1 | retail_and_recreation_percent_change_from_baseline | |
---|---|---|
3739259 | Gloucestershire | -16.0 |
3741716 | Greater London | -14.0 |
3763910 | Greater London | -13.0 |
3742538 | Greater London | -12.0 |
3758978 | Greater London | -11.0 |
... | ... | ... |
3948491 | Tyne and Wear | -1.0 |
3885110 | Nottinghamshire | -1.0 |
3772952 | Greater Manchester | -1.0 |
3930454 | Suffolk | -1.0 |
3803289 | Kent | -1.0 |
167 rows × 2 columns
# UK counties with the lower retail and recreation
# the day after UK went in lockdown on 23 March 2020
mobility_trends_UK_24MARCH2020 = mobility_trends_UK.loc[
(mobility_trends_UK["date"] == "2020-03-24")
& (mobility_trends_UK["retail_and_recreation_percent_change_from_baseline"] < 0),
["sub_region_1", "retail_and_recreation_percent_change_from_baseline"],
]
# Sort in decreasing order
mobility_trends_UK_24MARCH2020.sort_values(
by="retail_and_recreation_percent_change_from_baseline", ascending=True
)
sub_region_1 | retail_and_recreation_percent_change_from_baseline | |
---|---|---|
3741730 | Greater London | -94.0 |
3742552 | Greater London | -92.0 |
3747484 | Greater London | -87.0 |
3767212 | Greater London | -84.0 |
3756526 | Greater London | -83.0 |
... | ... | ... |
3969853 | West Sussex | -56.0 |
3686799 | Cumbria | -56.0 |
3808235 | Kent | -54.0 |
3809879 | Kent | -54.0 |
3842701 | Merseyside | -50.0 |
417 rows × 2 columns
Let’s identify the places in the UK with the highest reduction of workplaces mobility.
# We need to first find the index of the row
# with the highest reduction of workplaces mobility
mobility_trends_UK_workplaces = mobility_trends_UK[
"workplaces_percent_change_from_baseline"
].idxmin()
mobility_trends_UK_workplaces
3756546
# Use the above index to access the row
# with the highest reduction of workplaces mobility
mobility_trends.iloc[[mobility_trends_UK_workplaces]]
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3756546 | GB | United Kingdom | Greater London | London Borough of Islington | NaN | GB-ISL | NaN | ChIJ5Tvr7V0bdkgRyWXsGMuF8FQ | 2020-04-13 | -88.0 | -51.0 | -28.0 | -90.0 | -91.0 | 40.0 |
In our basic exploratory data analysis, we found that as of 21 April 2021 the largest percentage mobility change in workplaces was in the London Borough of Islington, Greater London, on 13 April 2020 when the workplaces mobility was reduced with 91 per cent compared to a baseline period. This result is only for exploratory purposes. Note that the same mobility in two different days would result in different percentage changes due to the way in which actual mobility is compared to a baseline in the Google Covid-19 Community Mobility Reports.
Handling Missing Values¶
Many data sets have missing values and the mobility data is not an exception. Handling missing values is a very common problem in data analysis. We will cover two simple ways of handling missing values in data:
Deletion — missing observations are simply dropped from the data set.
Likewise - removes all data for an observation that has one or more missing values.
Pairwise — removes observations with missing values only for the variables used in the current procedure (e.g., model or technique).
Imputation — missing observations are replaced with inferred values. We will cover simple imputation methods in which missing values are replaced with column mean or row mean.
For more advanced methods and functionality, including interpolation methods, see the section ‘Working with missing data’ in Pandas documentation.
Detecting missing values¶
In pandas
, missing values are represented as NaN
, which stands for Not a Number. To detect missing values, we can use the function isna()
(other functions, for example isnull()
can be used to the same effect). The function returns boolean values (False, True) indicating whether each corresponding value in a DataFrame is missing.
mobility_trends.isna()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | True | True | True | True | False | False | False | False | False | False | False | False |
1 | False | False | True | True | True | True | True | False | False | False | False | False | False | False | False |
2 | False | False | True | True | True | True | True | False | False | False | False | False | False | False | False |
3 | False | False | True | True | True | True | True | False | False | False | False | False | False | False | False |
4 | False | False | True | True | True | True | True | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9900501 | False | False | False | False | True | True | True | False | False | True | True | True | True | False | True |
9900502 | False | False | False | False | True | True | True | False | False | True | True | True | True | False | True |
9900503 | False | False | False | False | True | True | True | False | False | True | True | True | True | False | True |
9900504 | False | False | False | False | True | True | True | False | False | True | True | True | True | False | True |
9900505 | False | False | False | False | True | True | True | False | False | True | True | True | True | False | True |
9900506 rows × 15 columns
Let’s determine the number of missing values for each column in the DataFrame.
mobility_trends.isna().sum()
country_region_code 6183
country_region 0
sub_region_1 163754
sub_region_2 1610543
metro_area 9847293
iso_3166_2_code 8173140
census_fips_code 7793263
place_id 18182
date 0
retail_and_recreation_percent_change_from_baseline 3769374
grocery_and_pharmacy_percent_change_from_baseline 4048994
parks_percent_change_from_baseline 5217709
transit_stations_percent_change_from_baseline 4988015
workplaces_percent_change_from_baseline 354374
residential_percent_change_from_baseline 3846476
dtype: int64
It would be useful to quantify the percentage of values that are missing per column. To achieve this, we simply divide the number of missing values in a column by the number of rows in the DataFrame (and multiply the result by 100).
mobility_trends.isna().sum() / mobility_trends.shape[0] * 100
country_region_code 0.062451
country_region 0.000000
sub_region_1 1.653996
sub_region_2 16.267280
metro_area 99.462522
iso_3166_2_code 82.552750
census_fips_code 78.715805
place_id 0.183647
date 0.000000
retail_and_recreation_percent_change_from_baseline 38.072539
grocery_and_pharmacy_percent_change_from_baseline 40.896839
parks_percent_change_from_baseline 52.701438
transit_stations_percent_change_from_baseline 50.381415
workplaces_percent_change_from_baseline 3.579352
residential_percent_change_from_baseline 38.851307
dtype: float64
Let’s now repeat the procedure for the UK mobility data only:
mobility_trends_UK.isna().sum() / mobility_trends_UK.shape[0] * 100
country_region_code 0.000000
country_region 0.000000
sub_region_1 0.239822
sub_region_2 36.107237
metro_area 100.000000
iso_3166_2_code 49.442166
census_fips_code 100.000000
place_id 0.000000
date 0.000000
retail_and_recreation_percent_change_from_baseline 2.760289
grocery_and_pharmacy_percent_change_from_baseline 2.684724
parks_percent_change_from_baseline 21.118353
transit_stations_percent_change_from_baseline 3.234390
workplaces_percent_change_from_baseline 0.518739
residential_percent_change_from_baseline 5.692713
dtype: float64
The output shows that the missing values in most mobility categories are below 5 per cent except the missing values for residential mobility (8%) and parks mobility (21.7%).
Removing missing values¶
We can remove missing values using the Pandas function dropna()
. The default functionality removes rows that contain at least one missing value. The dropna()
function provides other capabilities as well. For example, it is possible to remove columns (instead of rows) that contain missing values by specifying axis = 1
within the brackets. Also, we can specify under what conditions rows or columns are removed: use how = any
if you would like to drop rows or columns that contain any missing value, which is the default functionality; use how = all
if you would like to drop rows or columns in which all values are missing.
mobility_trends_UK.dropna()
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline |
---|
We did not specify particular columns on which to perform the dropna()
function. Because we have a column (‘metro_area’) for which all values are missing, the dropna()
function removed all observations from the DataFrame. This is analogous to a likewise deletion of missing values.
A more informative approach would be to specify the columns we are going to use in our analysis and drop missing values only for this subset of variables. We can drop rows that contain any missing values in the six mobility categories by passing the labels of those categories to the parameter subset
as shown below:
mobility_trends_UK.dropna(subset=mobility_trends_UK.iloc[:, 9:15].columns)
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3647576 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-15 | -12.0 | -7.0 | -35.0 | -12.0 | -4.0 | 2.0 |
3647577 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-16 | -7.0 | -6.0 | -28.0 | -7.0 | -3.0 | 1.0 |
3647578 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-17 | 10.0 | 1.0 | 24.0 | -2.0 | -14.0 | 2.0 |
3647579 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-18 | 7.0 | -1.0 | 20.0 | -3.0 | -14.0 | 2.0 |
3647580 | GB | United Kingdom | NaN | NaN | NaN | NaN | NaN | ChIJqZHHQhE7WgIReiWIMkOg-MQ | 2020-02-19 | 6.0 | -2.0 | 8.0 | -4.0 | -14.0 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3990325 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-12 | -3.0 | 10.0 | 91.0 | -20.0 | -25.0 | 5.0 |
3990326 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-13 | -9.0 | 8.0 | 71.0 | -12.0 | -26.0 | 5.0 |
3990327 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-14 | -15.0 | 6.0 | 23.0 | -7.0 | -5.0 | -1.0 |
3990328 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-15 | -8.0 | 9.0 | 7.0 | -10.0 | -5.0 | 0.0 |
3990329 | GB | United Kingdom | York | NaN | NaN | GB-YOR | NaN | ChIJh-IigLwxeUgRAKFv7Z75DAM | 2022-05-16 | -5.0 | 9.0 | 12.0 | -24.0 | -30.0 | 6.0 |
255595 rows × 15 columns
mobility_trends_UK.shape
(342754, 15)
After removing the rows that contain missing values in the six mobility categories/columns, we are left with 137992 rows out of 191064 rows, a reduction of approximately 28%.
Replacing missing values¶
Let’s replace NaN values with the mean of non-missing values. For example, we can replace missing values in the columns ‘workplaces_percent_change_from_baseline’ and ‘residential_percent_change_from_baseline’ by the mean of values in the two columns as determined by the Pandas mean()
method.
# Replace the NaNs in the columns 'workplaces_percent_change_from_baseline'
# and 'residential_percent_change_from_baseline'
# by the mean of the non-missing values.
mobility_trends_UK[
[
"workplaces_percent_change_from_baseline",
"residential_percent_change_from_baseline",
]
].fillna(
value=mobility_trends_UK[
[
"workplaces_percent_change_from_baseline",
"residential_percent_change_from_baseline",
]
].mean()
)
workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|
3647576 | -4.0 | 2.0 |
3647577 | -3.0 | 1.0 |
3647578 | -14.0 | 2.0 |
3647579 | -14.0 | 2.0 |
3647580 | -14.0 | 3.0 |
... | ... | ... |
3990325 | -25.0 | 5.0 |
3990326 | -26.0 | 5.0 |
3990327 | -5.0 | -1.0 |
3990328 | -5.0 | 0.0 |
3990329 | -30.0 | 6.0 |
342754 rows × 2 columns
We now have the same number of rows (191064) as the original mobility_trends_UK DataFrame, with each missing value replaced by the mean value for the respective column. More advanced (and practical) approaches of replacing missing values can take into account groupings in the data and replace missing values with the mean value of a group of counties or the mean value of a time period; those operations would require more advanced Pandas functionality such as groupby()
, which we will use in the next lab.
Compare mobility data formats¶
A related aggregate, privacy-preserving, and freely available mobility data set is the Apple’s Mobility Trends Reports. The data is updated daily and reflect in aggregate requests for directions in Apple Maps for select countries/regions, sub-regions and cities.
The Mobility Trends Reports data can be downloaded from the website https://covid19.apple.com/mobility. However, because the data file is not at a stable location (URL), we cannot load the data from the website but instead access the data from a COVID-19 mobility data aggregator on GitHub.
mobility_trends_covid_apple = pd.read_csv(
"https://raw.githubusercontent.com/ActiveConclusion/COVID19_mobility/master/apple_reports/applemobilitytrends.csv"
)
mobility_trends_covid_apple.head()
geo_type | region | transportation_type | alternative_name | sub-region | country | 2020-01-13 | 2020-01-14 | 2020-01-15 | 2020-01-16 | ... | 2022-04-03 | 2022-04-04 | 2022-04-05 | 2022-04-06 | 2022-04-07 | 2022-04-08 | 2022-04-09 | 2022-04-10 | 2022-04-11 | 2022-04-12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | country/region | Albania | driving | NaN | NaN | NaN | 100.0 | 95.30 | 101.43 | 97.20 | ... | 149.58 | 156.76 | 156.76 | 151.02 | 157.86 | 185.12 | 196.62 | 173.11 | 173.21 | 171.09 |
1 | country/region | Albania | walking | NaN | NaN | NaN | 100.0 | 100.68 | 98.93 | 98.46 | ... | 161.71 | 215.31 | 221.62 | 211.34 | 189.25 | 244.22 | 231.90 | 154.93 | 223.97 | 223.45 |
2 | country/region | Argentina | driving | NaN | NaN | NaN | 100.0 | 97.07 | 102.45 | 111.21 | ... | 93.32 | 97.74 | 102.63 | 106.26 | 111.81 | 145.45 | 158.56 | 99.20 | 100.82 | 114.56 |
3 | country/region | Argentina | walking | NaN | NaN | NaN | 100.0 | 95.11 | 101.37 | 112.67 | ... | 88.58 | 126.43 | 133.06 | 135.64 | 142.40 | 178.26 | 172.83 | 98.96 | 126.57 | 139.29 |
4 | country/region | Australia | driving | AU | NaN | NaN | 100.0 | 102.98 | 104.21 | 108.63 | ... | 123.95 | 116.32 | 132.29 | 134.97 | 145.09 | 146.34 | 125.79 | 136.60 | 139.01 | 146.75 |
5 rows × 827 columns
Activity: Discussing data formats
Compare the format of the data in the Apple Mobility Trends Reports to the format of the data in the Google COVID-19 Community Mobility Reports and in Our World in Data COVID-19 data. Focus in particular on similarities and differences in how time series data is represented across data sets.
Recording dependencies¶
# Install the watermark extension
!pip install -q watermark
# Load the watermark extension
%load_ext watermark
# Show packages that were imported
%watermark --iversions
pandas: 1.1.3
Save a list of the packages (and their versions) used in the current notebook to a file named requirements.txt
.
# Run pipreqsnb after specifying the path to the notebook
!pipreqsnb ../notebooks/04_data_design_and_data_wrangling.ipynb
pipreqs --savepath ../notebooks/requirements.txt .//__temp_pipreqsnb_folder/
INFO: Successfully saved requirements file in ../notebooks/requirements.txt