Data Design & Data Wrangling

‘Data wrangling’ is the mundane labor of collecting, loading, transforming, and preparing data for exploration and analysis. Although it is hard to put a number on it, estimates indicate that data scientists spend from 50 per cent to 80 per cent of their time on data wrangling. The importance of data wrangling has been recognised in social sciencies as well.

Although data wrangling may seem a-theoretical, some of the decisions you would make would in fact depend on how the data were generated or designed.

In this lab, we will first discuss issues related to data design and will then practice hands-on data wrangling with pandas and other Python libraries for data analysis. We will wrangle open data sets, including the Google Covid-19 Community Mobility Reports and Apple Mobility Trends Reports.

Key themes

  • What are data? How are data generated (i.e., data design)?

  • Data wrangling with pandas: loading, selecting, and transforming data.

Learning resources

Matthew Salganik. 2.3 Ten common characteristics of big data. In Bit by Bit: Social Research in the Digital Age. [Online version freely available]

Lisa Gitelman and Virginia Jackson. “Raw Data” Is an Oxymoron. Introduction. MIT Press.

Sabina Leonelli. Scientific Research and Big Data. The Stanford Encyclopedia of Philosophy.

Sam Lau, Joey Gonzalez, and Deb Nolan. The Data Science Lifecycle. In Principles and Techniques of Data Science.

Steve Lohr. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times.

Alexander Kindel et al. Improving Metadata Infrastructure for Complex Surveys: Insights from the Fragile Families Challenge. Socius.

Introduction to Data Processing in Python with Pandas, SciPy 2019 Tutorial by Daniel Chen. See also freely available Jupyter notebooks: 01-intro.ipynb and 02-tidy.ipynb.


Activity: Discussing data biases

After reading the section Ten common characteristics of big data in Matthew Salganik’s book Bit by Bit, we will discuss how each characteristic of big data applies to the digital data sets we are analysing in this and next sections, including the Google COVID-19 Community Mobility Reports and Apple Mobility Trends Reports.

Get started with pandas

The pandas library:

  • is a fast, powerful, and flexible open source tool for doing real world data analysis in Python.

  • offers a diverse range of high-performance tools for data loading, cleaning, wrangling, merging, reshaping, and summarising.

  • is the go-to data science library in Python.

Importing pandas

To use pandas, we first import the library via the Python’s import command and give it the alias pd:

import pandas as pd
pd.__version__
'1.1.3'

Loading your data

Pandas supports many data file formats, including csv, excel, sql, json. For details, see How do I read and write tabular data? For example, once you import pandas as pd, you can load a .csv file using the function pd.read_csv(), an Excel file using the function pd.read_excel(), and a JSON file using the function pd.read_json().

Loading Covid-19 data sets via URL

What are the Google Covid-19 Community Mobility Reports?

  • Aggregated and anonymized sets of freely available data that protect individual privacy.

  • Shows trends of human mobility over time by country and region, across six categories of places, including retail and recreation, groceries and pharmacies, parks, transit stations, workplaces, and residential.

  • The data shows how visitors to (or time spent in) categorized places change compared to baseline days. A baseline day represents a normal value for that day of the week. The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.

An overview of the data from the mobility data can be found here.

The code below loads the most recent version of the Covid-19 Google Community Mobility Reports as a pandas DataFrame object and assign the object to a variable called ‘mobility_trends’

mobility_trends = pd.read_csv('https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv')

Pandas load the data as a DataFrame object. We can confirm this by using the built-in function type():

type(mobility_trends)
pandas.core.frame.DataFrame

What is a DataFrame?

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns.

Viewing, Describing, and Accessing your Data

Once we have our data as a DataFrame data structure, we can view, describe, and access the data by applying basic functionalities associated with that structure. For example, we can perform various operations by using the so called methods. Examples of pandas methods are head() and tail(); head() displays by default the top five rows of the data and tail() displays by default the last five rows. You can display a custom number of row by passing that number in the brackets.

View your data

# Show the first five rows using the method DataFrame.head()
mobility_trends.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-15 0.0 4.0 5.0 0.0 2.0 1.0
1 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-16 1.0 4.0 4.0 1.0 2.0 1.0
2 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-17 -1.0 1.0 5.0 1.0 2.0 1.0
3 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-18 -2.0 1.0 5.0 0.0 2.0 1.0
4 AE United Arab Emirates NaN NaN NaN NaN NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM 2020-02-19 -2.0 0.0 4.0 -1.0 2.0 1.0
# Show the last five rows using the method DataFrame.tail()   
mobility_trends.tail()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
5435339 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-12 NaN NaN NaN NaN 26.0 NaN
5435340 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-13 NaN NaN NaN NaN 19.0 NaN
5435341 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-14 NaN NaN NaN NaN 24.0 NaN
5435342 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-17 NaN NaN NaN NaN 26.0 NaN
5435343 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-18 NaN NaN NaN NaN 35.0 NaN
# Specify the number of rows to return
mobility_trends.tail(10)
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
5435334 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-05 NaN NaN NaN NaN 25.0 NaN
5435335 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-06 NaN NaN NaN NaN 19.0 NaN
5435336 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-07 NaN NaN NaN NaN 24.0 NaN
5435337 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-10 NaN NaN NaN NaN 14.0 NaN
5435338 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-11 NaN NaN NaN NaN 24.0 NaN
5435339 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-12 NaN NaN NaN NaN 26.0 NaN
5435340 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-13 NaN NaN NaN NaN 19.0 NaN
5435341 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-14 NaN NaN NaN NaN 24.0 NaN
5435342 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-17 NaN NaN NaN NaN 26.0 NaN
5435343 ZW Zimbabwe Midlands Province Kwekwe NaN NaN NaN ChIJRcIZ3-FJNBkRRsj55IcLpfU 2021-05-18 NaN NaN NaN NaN 35.0 NaN

Describing your DataFrame

# Dimensionality — number of rows and columns — of a DataFrame  
mobility_trends.shape
(5435344, 15)
# Use the print function to display the number of rows and columns in a DataFrame at the time of cell execution
# Import the 'datetime' module to display the current date

import datetime
print("\nAs of", datetime.datetime.now(), "The Google COVID-19 Community Mobility Reports contain", 
      mobility_trends.shape[0], "rows and", mobility_trends.shape[1],"columns.")
As of 2021-05-22 11:47:01.148461 The Google COVID-19 Community Mobility Reports contain 5435344 rows and 15 columns.
# Accessing columns using the DataFrame.columns attribute
mobility_trends.columns
Index(['country_region_code', 'country_region', 'sub_region_1', 'sub_region_2',
       'metro_area', 'iso_3166_2_code', 'census_fips_code', 'place_id', 'date',
       'retail_and_recreation_percent_change_from_baseline',
       'grocery_and_pharmacy_percent_change_from_baseline',
       'parks_percent_change_from_baseline',
       'transit_stations_percent_change_from_baseline',
       'workplaces_percent_change_from_baseline',
       'residential_percent_change_from_baseline'],
      dtype='object')

Let’s display a concise summary of our DataFrame using the method info()

# Information about a DataFrame
mobility_trends.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5435344 entries, 0 to 5435343
Data columns (total 15 columns):
 #   Column                                              Dtype  
---  ------                                              -----  
 0   country_region_code                                 object 
 1   country_region                                      object 
 2   sub_region_1                                        object 
 3   sub_region_2                                        object 
 4   metro_area                                          object 
 5   iso_3166_2_code                                     object 
 6   census_fips_code                                    float64
 7   place_id                                            object 
 8   date                                                object 
 9   retail_and_recreation_percent_change_from_baseline  float64
 10  grocery_and_pharmacy_percent_change_from_baseline   float64
 11  parks_percent_change_from_baseline                  float64
 12  transit_stations_percent_change_from_baseline       float64
 13  workplaces_percent_change_from_baseline             float64
 14  residential_percent_change_from_baseline            float64
dtypes: float64(7), object(8)
memory usage: 622.0+ MB

Among other information, displayed are labels of the columns and the data type of each column. For example ‘float64’ refers to a floating point number, i.e., a number with a decimal point. In the Community Mobility Reports, floating point numbers are used to display the percentage mobility change for each category of places.

Accessing columns and rows in your data

Accessing columns

You can can access columns via column name and column position.

  • Accessing columns via column name

Let’s access a single column.

# Get the country column and save it to its own variable
# The double square bracket option `[[]]` gives DataFrame

mobility_trends_countries = mobility_trends[['country_region']]
mobility_trends_countries.head()
country_region
0 United Arab Emirates
1 United Arab Emirates
2 United Arab Emirates
3 United Arab Emirates
4 United Arab Emirates
# Display the type of data structure

type(mobility_trends_countries)
pandas.core.frame.DataFrame
# The single square braket `[]` option gives Series
# pandas Series is a one-dimensional data structure or simply put, a single column 

mobility_trends['country_region'].head()
0    United Arab Emirates
1    United Arab Emirates
2    United Arab Emirates
3    United Arab Emirates
4    United Arab Emirates
Name: country_region, dtype: object
# Display the type of data structure
type(mobility_trends_countries)
pandas.core.frame.DataFrame

Let’s now access multiple columns.

# Accessing multiple columns using a Python list
# You can create lists using square brakets, for example: ['country_region', 'sub_region_1', 'date']
# Python lists are useful as they store multiple items in the same variable 

mobility_trends[['country_region', 'sub_region_1', 'date']].head()
country_region sub_region_1 date
0 United Arab Emirates NaN 2020-02-15
1 United Arab Emirates NaN 2020-02-16
2 United Arab Emirates NaN 2020-02-17
3 United Arab Emirates NaN 2020-02-18
4 United Arab Emirates NaN 2020-02-19
  • Accessing columns via column position

# Accessing columns 1, 2, adn 7 via column position and show the top five rows

mobility_trends.iloc[:, [1, 2, 7]].head()
country_region sub_region_1 place_id
0 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
1 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
2 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
3 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
4 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
# Accessing a subset of rows and columns. We access the top 3 rows and columns 1, 2, and 7

mobility_trends.iloc[0:3, [1, 2, 7]].head()
country_region sub_region_1 place_id
0 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
1 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM
2 United Arab Emirates NaN ChIJvRKrsd9IXj4RpwoIwFYv0zM

Hands-on exercise

  1. Create a new code cell.

  2. Access the first 100 rows and the following columns from the DataFrame mobility_trends:

    • country_region_code

    • date

    • retail_recreation

  3. Assign the output to a variable called mobility_trends_subset.

Accessing rows

Rows can be accessed via row labels df.loc and row index df.iloc.

# Before accessing particular rows, let's see the names of all countries in the dataset. We can achieve that by listing all unique values in the 'country_region' column.

mobility_trends.country_region.unique()
array(['United Arab Emirates', 'Afghanistan', 'Antigua and Barbuda',
       'Angola', 'Argentina', 'Austria', 'Australia', 'Aruba',
       'Bosnia and Herzegovina', 'Barbados', 'Bangladesh', 'Belgium',
       'Burkina Faso', 'Bulgaria', 'Bahrain', 'Benin', 'Bolivia',
       'Brazil', 'The Bahamas', 'Botswana', 'Belarus', 'Belize', 'Canada',
       'Switzerland', "Côte d'Ivoire", 'Chile', 'Cameroon', 'Colombia',
       'Costa Rica', 'Cape Verde', 'Czechia', 'Germany', 'Denmark',
       'Dominican Republic', 'Ecuador', 'Estonia', 'Egypt', 'Spain',
       'Finland', 'Fiji', 'France', 'Gabon', 'United Kingdom', 'Georgia',
       'Ghana', 'Greece', 'Guatemala', 'Guinea-Bissau', 'Hong Kong',
       'Honduras', 'Croatia', 'Haiti', 'Hungary', 'Indonesia', 'Ireland',
       'Israel', 'India', 'Iraq', 'Italy', 'Jamaica', 'Jordan', 'Japan',
       'Kenya', 'Kyrgyzstan', 'Cambodia', 'South Korea', 'Kuwait',
       'Kazakhstan', 'Laos', 'Lebanon', 'Liechtenstein', 'Sri Lanka',
       'Lithuania', 'Luxembourg', 'Latvia', 'Libya', 'Morocco', 'Moldova',
       'North Macedonia', 'Mali', 'Myanmar (Burma)', 'Mongolia', 'Malta',
       'Mauritius', 'Mexico', 'Malaysia', 'Mozambique', 'Namibia',
       'Niger', 'Nigeria', 'Nicaragua', 'Netherlands', 'Norway', 'Nepal',
       'New Zealand', 'Oman', 'Panama', 'Peru', 'Papua New Guinea',
       'Philippines', 'Pakistan', 'Poland', 'Puerto Rico', 'Portugal',
       'Paraguay', 'Qatar', 'Réunion', 'Romania', 'Serbia', 'Russia',
       'Rwanda', 'Saudi Arabia', 'Sweden', 'Singapore', 'Slovenia',
       'Slovakia', 'Senegal', 'El Salvador', 'Togo', 'Thailand',
       'Tajikistan', 'Turkey', 'Trinidad and Tobago', 'Taiwan',
       'Tanzania', 'Ukraine', 'Uganda', 'United States', 'Uruguay',
       'Venezuela', 'Vietnam', 'Yemen', 'South Africa', 'Zambia',
       'Zimbabwe'], dtype=object)
# Accessing specific rows from a DataFrame
# We are interested in the data about the United Kingdom

mobility_trends_UK = mobility_trends[mobility_trends['country_region']=='United Kingdom']
mobility_trends_UK.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
1988242 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
1988243 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
1988244 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
1988245 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
1988246 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
# Accessing data about multiple countries using the isin() method
# The backslash symbol \ can be used to split a long command over multiple lines 
mobility_trends_countries = mobility_trends[mobility_trends['country_region'].\
                                                  isin(['United Kingdom', 'Germany', 'Italy', 'Sweden'])]
mobility_trends_countries.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
1726697 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-15 6.0 1.0 45.0 10.0 0.0 -1.0
1726698 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-16 7.0 10.0 9.0 6.0 -1.0 0.0
1726699 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-17 2.0 2.0 7.0 1.0 -2.0 0.0
1726700 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-18 2.0 2.0 10.0 1.0 -1.0 1.0
1726701 DE Germany NaN NaN NaN NaN NaN ChIJa76xwh5ymkcRW-WRjmtd6HU 2020-02-19 3.0 0.0 6.0 -1.0 -1.0 1.0
# Filter by two conditions — country and county — simultenously
# First let's see the list of counties in the dataset

mobility_trends[mobility_trends['country_region']=='United Kingdom'].sub_region_1.unique()
array([nan, 'Aberdeen City', 'Aberdeenshire', 'Angus Council',
       'Antrim and Newtownabbey', 'Ards and North Down',
       'Argyll and Bute Council', 'Armagh City, Banbridge and Craigavon',
       'Bath and North East Somerset', 'Bedford', 'Belfast',
       'Blackburn with Darwen', 'Blackpool', 'Blaenau Gwent',
       'Borough of Halton', 'Bracknell Forest', 'Bridgend County Borough',
       'Brighton and Hove', 'Bristol City', 'Buckinghamshire',
       'Caerphilly County Borough', 'Cambridgeshire', 'Cardiff',
       'Carmarthenshire', 'Causeway Coast and Glens',
       'Central Bedfordshire', 'Ceredigion', 'Cheshire East',
       'Cheshire West and Chester', 'Clackmannanshire',
       'Conwy Principal Area', 'Cornwall', 'County Durham', 'Cumbria',
       'Darlington', 'Denbighshire', 'Derby', 'Derbyshire',
       'Derry and Strabane', 'Devon', 'Dorset', 'Dumfries and Galloway',
       'Dundee City Council', 'East Ayrshire Council',
       'East Dunbartonshire Council', 'East Lothian Council',
       'East Renfrewshire Council', 'East Riding of Yorkshire',
       'East Sussex', 'Edinburgh', 'Essex', 'Falkirk',
       'Fermanagh and Omagh', 'Fife', 'Flintshire', 'Glasgow City',
       'Gloucestershire', 'Greater London', 'Greater Manchester',
       'Gwynedd', 'Hampshire', 'Hartlepool', 'Herefordshire',
       'Hertfordshire', 'Highland Council', 'Inverclyde',
       'Isle of Anglesey', 'Isle of Wight', 'Kent', 'Kingston upon Hull',
       'Lancashire', 'Leicester', 'Leicestershire', 'Lincolnshire',
       'Lisburn and Castlereagh', 'Luton', 'Medway', 'Merseyside',
       'Merthyr Tydfil County Borough', 'Mid and East Antrim',
       'Mid Ulster', 'Middlesbrough', 'Midlothian', 'Milton Keynes',
       'Monmouthshire', 'Moray', 'Na h-Eileanan an Iar',
       'Neath Port Talbot Principle Area', 'Newport',
       'Newry, Mourne and Down', 'Norfolk', 'North Ayrshire Council',
       'North East Lincolnshire', 'North Lanarkshire',
       'North Lincolnshire', 'North Somerset', 'North Yorkshire',
       'Northamptonshire', 'Northumberland', 'Nottingham',
       'Nottinghamshire', 'Orkney', 'Oxfordshire', 'Pembrokeshire',
       'Perth and Kinross', 'Peterborough', 'Plymouth', 'Portsmouth',
       'Powys', 'Reading', 'Redcar and Cleveland', 'Renfrewshire',
       'Rhondda Cynon Taff', 'Rutland', 'Scottish Borders',
       'Shetland Islands', 'Shropshire', 'Slough', 'Somerset',
       'South Ayrshire Council', 'South Gloucestershire',
       'South Lanarkshire', 'South Yorkshire', 'Southampton',
       'Southend-on-Sea', 'Staffordshire', 'Stirling', 'Stockton-on-Tees',
       'Stoke-on-Trent', 'Suffolk', 'Surrey', 'Swansea', 'Swindon',
       'Thurrock', 'Torbay', 'Torfaen Principal Area', 'Tyne and Wear',
       'Vale of Glamorgan', 'Warrington', 'Warwickshire',
       'West Berkshire', 'West Dunbartonshire Council', 'West Lothian',
       'West Midlands', 'West Sussex', 'West Yorkshire', 'Wiltshire',
       'Windsor and Maidenhead', 'Wokingham', 'Worcestershire',
       'Wrexham Principal Area', 'York'], dtype=object)
# Access data about UK and Essex
mobility_trends_UK_Essex = mobility_trends[(mobility_trends['country_region'] == 'United Kingdom') & 
                                 (mobility_trends['sub_region_1']=='Essex')]

mobility_trends_UK_Essex.head()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
2028787 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-15 -9.0 -7.0 -27.0 -11.0 -3.0 2.0
2028788 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-16 -16.0 -12.0 -45.0 -9.0 -5.0 2.0
2028789 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-17 9.0 -1.0 23.0 -1.0 -16.0 3.0
2028790 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-18 10.0 -4.0 15.0 0.0 -16.0 2.0
2028791 GB United Kingdom Essex NaN NaN GB-ESS NaN ChIJ0w2H_idW2EcReVDuRzjLV0I 2020-02-19 9.0 -3.0 10.0 1.0 -16.0 2.0

In the above example, you could simply type in

mobility_trends[mobility_trends['sub_region_1']=='Essex']

The command will return the same result as there is no region called ‘Essex’ outside the UK. However, there are instances in the data where regions in different countries have the same name. When this is the case, not specifying the country would return confusing outputs.

Accessing multiple rows and columns and conditioning

mobility_trends_UK
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
1988242 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
1988243 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
1988244 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
1988245 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
1988246 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2179301 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-14 -33.0 10.0 10.0 -31.0 -38.0 11.0
2179302 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-15 -40.0 4.0 -20.0 -21.0 -22.0 7.0
2179303 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-16 -36.0 4.0 -4.0 -23.0 -21.0 6.0
2179304 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-17 -14.0 7.0 19.0 -25.0 -38.0 10.0
2179305 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-18 -11.0 9.0 47.0 -27.0 -38.0 10.0

191064 rows × 15 columns

# Let's see which UK counties had the lower retail and recreation mobility the day after Italy went in lockdown on 9 March 2020
mobility_trends_UK_10MARCH2020 = mobility_trends_UK.loc[(mobility_trends_UK['date']=='2020-03-10') &
                                       (mobility_trends_UK['retail_and_recreation_percent_change_from_baseline']<0),
                                       ['sub_region_1','retail_and_recreation_percent_change_from_baseline']]

# Sort in decreasing order
mobility_trends_UK_10MARCH2020.sort_values(by='retail_and_recreation_percent_change_from_baseline', ascending=True)
sub_region_1 retail_and_recreation_percent_change_from_baseline
2039322 Gloucestershire -16.0
2040690 Greater London -14.0
2053083 Greater London -13.0
2041149 Greater London -12.0
2050329 Greater London -11.0
... ... ...
2155980 Tyne and Wear -1.0
2120655 Nottinghamshire -1.0
2058132 Greater Manchester -1.0
2145929 Suffolk -1.0
2075038 Kent -1.0

167 rows × 2 columns

# UK counties with the lower retail and recreation the day after UK went in lockdown on 23 March 2020 
mobility_trends_UK_24MARCH2020=mobility_trends_UK.loc[(mobility_trends_UK['date']=='2020-03-24') & 
                                       (mobility_trends_UK['retail_and_recreation_percent_change_from_baseline']<0), 
                                       ['sub_region_1','retail_and_recreation_percent_change_from_baseline']]

# Sort in decreasing order
mobility_trends_UK_24MARCH2020.sort_values(by='retail_and_recreation_percent_change_from_baseline', ascending=True)
sub_region_1 retail_and_recreation_percent_change_from_baseline
2040704 Greater London -94.0
2041163 Greater London -92.0
2043917 Greater London -87.0
2054933 Greater London -84.0
2048966 Greater London -83.0
... ... ...
2167904 West Sussex -56.0
2010094 Cumbria -56.0
2077806 Kent -54.0
2078724 Kent -54.0
2097026 Merseyside -50.0

417 rows × 2 columns

Let’s identify the places in the UK with the highest reduction of workplaces mobility.

# We need to first find the index of the row with the highest reduction of workplaces mobility
mobility_trends_UK_workplaces = mobility_trends_UK['workplaces_percent_change_from_baseline'].idxmin()
mobility_trends_UK_workplaces
2048986
# Use the above index to access the row with the highest reduction of workplaces mobility
mobility_trends.iloc[[mobility_trends_UK_workplaces]]
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
2048986 GB United Kingdom Greater London London Borough of Islington NaN GB-ISL NaN ChIJ5Tvr7V0bdkgRyWXsGMuF8FQ 2020-04-13 -88.0 -51.0 -28.0 -90.0 -91.0 40.0

In our basic exploratory data analysis, we found that as of 21 April 2021 the largest percentage mobility change in workplaces was in the London Borough of Islington, Greater London, on 13 April 2020 when the workplaces mobility was reduced with 91 per cent compared to a baseline period. This result is only for exploratory purposes. Note that the same mobility in two different days would result in different percentage changes due to the way in which actual mobility is compared to a baseline in the Google Covid-19 Community Mobility Reports.

Handling Missing Values

Many data sets have missing values and the mobility data is not an exception. Handling missing values is a very common problem in data analysis. We will cover two simple ways of handling missing values in data:

  • Deletion — missing observations are simply dropped from the data set.

    • Likewise - removes all data for an observation that has one or more missing values.

    • Pairwise — removes observations with missing values only for the variables used in the current procedure (e.g., model or technique).

  • Imputation — missing observations are replaced with inferred values. We will cover simple imputation methods in which missing values are replaced with column mean or row mean.

For more advanced methods and functionality, including interpolation methods, see the section ‘Working with missing data’ in Pandas documentation.

Detecting missing values

In pandas, missing values are represented as NaN, which stands for Not a Number. To detect missing values, we can use the function isna() (other functions, for example isnull() can be used to the same effect). The function returns boolean values (False, True) indicating whether each corresponding value in a DataFrame is missing.

mobility_trends.isna()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
0 False False True True True True True False False False False False False False False
1 False False True True True True True False False False False False False False False
2 False False True True True True True False False False False False False False False
3 False False True True True True True False False False False False False False False
4 False False True True True True True False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5461522 False False False False True True True False False True True True True False True
5461523 False False False False True True True False False True True True True False True
5461524 False False False False True True True False False True True True True False True
5461525 False False False False True True True False False True True True True False True
5461526 False False False False True True True False False True True True True False True

5461527 rows × 15 columns

Let’s determine the number of missing values for each column in the DataFrame.

mobility_trends.isna().sum()
country_region_code                                      3479
country_region                                              0
sub_region_1                                            91843
sub_region_2                                           900937
metro_area                                            5431697
iso_3166_2_code                                       4492972
census_fips_code                                      4307968
place_id                                                 9879
date                                                        0
retail_and_recreation_percent_change_from_baseline    2052299
grocery_and_pharmacy_percent_change_from_baseline     2183223
parks_percent_change_from_baseline                    2839762
transit_stations_percent_change_from_baseline         2715596
workplaces_percent_change_from_baseline                227198
residential_percent_change_from_baseline              2234460
dtype: int64

It would be useful to quantify the percentage of values that are missing per column. To achieve this, we simply divide the number of missing values in a column by the number of rows in the DataFrame (and multiply the result by 100).

mobility_trends.isna().sum()/mobility_trends.shape[0]*100
country_region_code                                    0.063700
country_region                                         0.000000
sub_region_1                                           1.681636
sub_region_2                                          16.496064
metro_area                                            99.453816
iso_3166_2_code                                       82.265857
census_fips_code                                      78.878453
place_id                                               0.180883
date                                                   0.000000
retail_and_recreation_percent_change_from_baseline    37.577384
grocery_and_pharmacy_percent_change_from_baseline     39.974590
parks_percent_change_from_baseline                    51.995751
transit_stations_percent_change_from_baseline         49.722285
workplaces_percent_change_from_baseline                4.159972
residential_percent_change_from_baseline              40.912734
dtype: float64

Let’s now repeat the procedure for the UK mobility data only:

mobility_trends_UK.isna().sum()/mobility_trends_UK.shape[0]*100
country_region_code                                     0.000000
country_region                                          0.000000
sub_region_1                                            0.240234
sub_region_2                                           36.080580
metro_area                                            100.000000
iso_3166_2_code                                        49.205502
census_fips_code                                      100.000000
place_id                                                0.000000
date                                                    0.000000
retail_and_recreation_percent_change_from_baseline      3.949462
grocery_and_pharmacy_percent_change_from_baseline       3.523950
parks_percent_change_from_baseline                     21.737742
transit_stations_percent_change_from_baseline           4.204874
workplaces_percent_change_from_baseline                 0.846837
residential_percent_change_from_baseline                8.076875
dtype: float64

The output shows that the missing values in most mobility categories are below 5 per cent except the missing values for residential mobility (8%) and parks mobility (21.7%).

Removing missing values

We can remove missing values using the Pandas function dropna(). The default functionality removes rows that contain at least one missing value. The dropna() function provides other capabilities as well. For example, it is possible to remove columns (instead of rows) that contain missing values by specifying axis = 1 within the brackets. Also, we can specify under what conditions rows or columns are removed: use how = any if you would like to drop rows or columns that contain any missing value, which is the default functionality; use how = all if you would like to drop rows or columns in which all values are missing.

mobility_trends_UK.dropna()
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline

We did not specify particular columns on which to perform the dropna() function. Because we have a column (‘metro_area’) for which all values are missing, the dropna() function removed all observations from the DataFrame. This is analogous to a likewise deletion of missing values.

A more informative approach would be to specify the columns we are going to use in our analysis and drop missing values only for this subset of variables. We can drop rows that contain any missing values in the six mobility categories by passing the labels of those categories to the parameter subset as shown below:

mobility_trends_UK.dropna(subset=mobility_trends_UK.iloc[:,9:15].columns)
country_region_code country_region sub_region_1 sub_region_2 metro_area iso_3166_2_code census_fips_code place_id date retail_and_recreation_percent_change_from_baseline grocery_and_pharmacy_percent_change_from_baseline parks_percent_change_from_baseline transit_stations_percent_change_from_baseline workplaces_percent_change_from_baseline residential_percent_change_from_baseline
1988242 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-15 -12.0 -7.0 -35.0 -12.0 -4.0 2.0
1988243 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-16 -7.0 -6.0 -28.0 -7.0 -3.0 1.0
1988244 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-17 10.0 1.0 24.0 -2.0 -14.0 2.0
1988245 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-18 7.0 -1.0 20.0 -3.0 -14.0 2.0
1988246 GB United Kingdom NaN NaN NaN NaN NaN ChIJqZHHQhE7WgIReiWIMkOg-MQ 2020-02-19 6.0 -2.0 8.0 -4.0 -14.0 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2179301 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-14 -33.0 10.0 10.0 -31.0 -38.0 11.0
2179302 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-15 -40.0 4.0 -20.0 -21.0 -22.0 7.0
2179303 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-16 -36.0 4.0 -4.0 -23.0 -21.0 6.0
2179304 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-17 -14.0 7.0 19.0 -25.0 -38.0 10.0
2179305 GB United Kingdom York NaN NaN GB-YOR NaN ChIJh-IigLwxeUgRAKFv7Z75DAM 2021-05-18 -11.0 9.0 47.0 -27.0 -38.0 10.0

137992 rows × 15 columns

mobility_trends_UK.shape
(191064, 15)

After removing the rows that contain missing values in the six mobility categories/columns, we are left with 137992 rows out of 191064 rows, a reduction of approximately 28%.

Replacing missing values

Let’s replace NaN values with the mean of non-missing values. For example, we can replace missing values in the columns ‘workplaces_percent_change_from_baseline’ and ‘residential_percent_change_from_baseline’ by the mean of values in the two columns as determined by the Pandas mean() method.

# Replace the NaNs in columns 'workplaces_percent_change_from_baseline' and 'residential_percent_change_from_baseline' by the mean of non-missing values
mobility_trends_UK[['workplaces_percent_change_from_baseline','residential_percent_change_from_baseline']].\
fillna(value=mobility_trends_UK[['workplaces_percent_change_from_baseline','residential_percent_change_from_baseline']].mean())
workplaces_percent_change_from_baseline residential_percent_change_from_baseline
1988242 -4.0 2.0
1988243 -3.0 1.0
1988244 -14.0 2.0
1988245 -14.0 2.0
1988246 -14.0 3.0
... ... ...
2179301 -38.0 11.0
2179302 -22.0 7.0
2179303 -21.0 6.0
2179304 -38.0 10.0
2179305 -38.0 10.0

191064 rows × 2 columns

We now have the same number of rows (191064) as the original mobility_trends_UK DataFrame, with each missing value replaced by the mean value for the respective column. More advanced (and practical) approaches of replacing missing values can take into account groupings in the data and replace missing values with the mean value of a group of counties or the mean value of a time period; those operations would require more advanced Pandas functionality such as groupby(), which we will use in the next lab.

Compare mobility data formats

A related aggregate, privacy-preserving, and freely available mobility data set is the Apple’s Mobility Trends Reports. The data is updated daily and reflect in aggregate requests for directions in Apple Maps for select countries/regions, sub-regions and cities.

The Mobility Trends Reports data can be downloaded from the website https://covid19.apple.com/mobility. However, because the data file is not at a stable location (URL), we cannot load the data from the website but instead access the data from a COVID-19 mobility data aggregator on GitHub.

mobility_trends_covid_apple = pd.read_csv('https://raw.githubusercontent.com/ActiveConclusion/COVID19_mobility/master/apple_reports/applemobilitytrends.csv')
mobility_trends_covid_apple.head()
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
geo_type region transportation_type alternative_name sub-region country 2020-01-13 2020-01-14 2020-01-15 2020-01-16 2020-01-17 2020-01-18 2020-01-19 2020-01-20 2020-01-21 2020-01-22 2020-01-23 2020-01-24 2020-01-25 2020-01-26 2020-01-27 2020-01-28 2020-01-29 2020-01-30 2020-01-31 2020-02-01 2020-02-02 2020-02-03 2020-02-04 2020-02-05 2020-02-06 2020-02-07 2020-02-08 2020-02-09 2020-02-10 2020-02-11 2020-02-12 2020-02-13 2020-02-14 2020-02-15 ... 2021-03-12 2021-03-13 2021-03-14 2021-03-15 2021-03-16 2021-03-17 2021-03-18 2021-03-19 2021-03-20 2021-03-21 2021-03-22 2021-03-23 2021-03-24 2021-03-25 2021-03-26 2021-03-27 2021-03-28 2021-03-29 2021-03-30 2021-03-31 2021-04-01 2021-04-02 2021-04-03 2021-04-04 2021-04-05 2021-04-06 2021-04-07 2021-04-08 2021-04-09 2021-04-10 2021-04-11 2021-04-12 2021-04-13 2021-04-14 2021-04-15 2021-04-16 2021-04-17 2021-04-18 2021-04-19 2021-04-20
0 country/region Albania driving NaN NaN NaN 100.0 95.30 101.43 97.20 103.55 112.67 104.83 94.39 94.07 93.51 92.94 102.13 102.38 101.41 94.62 89.12 90.17 90.21 97.71 102.50 108.92 92.82 91.48 93.99 96.72 102.46 103.29 107.83 87.99 94.18 94.62 99.70 139.30 123.90 ... NaN 126.15 140.06 111.40 111.94 102.92 105.37 110.95 124.45 113.95 98.84 117.22 102.75 105.33 114.25 142.29 150.08 115.24 120.16 113.23 116.35 128.46 134.67 128.77 141.59 127.08 114.13 119.86 141.35 168.45 181.11 134.52 121.12 113.90 108.87 118.82 137.63 141.20 116.15 114.33
1 country/region Albania walking NaN NaN NaN 100.0 100.68 98.93 98.46 100.85 100.13 82.13 95.65 97.78 95.39 94.24 93.73 97.06 77.27 83.37 82.73 84.39 88.19 90.79 88.70 79.32 87.12 88.06 99.40 85.84 94.63 99.74 81.41 90.19 90.45 94.16 95.69 109.21 108.40 ... NaN 134.41 126.61 124.39 142.09 143.24 134.16 129.94 133.05 110.41 122.64 142.64 140.04 129.98 146.01 158.98 143.58 146.31 141.75 139.66 149.34 149.94 146.87 131.22 164.78 151.60 142.05 153.30 168.96 178.72 156.97 155.61 128.57 143.03 145.42 143.71 153.22 140.68 149.68 142.30
2 country/region Argentina driving NaN NaN NaN 100.0 97.07 102.45 111.21 118.45 124.01 95.44 95.13 95.42 97.66 99.42 113.34 118.23 91.31 93.37 91.12 92.35 96.74 111.24 123.96 89.01 91.66 89.18 94.49 95.98 111.12 121.53 89.23 96.42 96.97 101.68 104.90 122.91 127.62 ... NaN 123.02 74.72 74.72 75.30 82.27 85.50 104.47 106.11 74.13 73.24 93.80 73.71 74.27 101.48 108.56 66.71 72.20 80.54 103.47 109.11 94.97 112.11 78.70 72.48 74.34 76.43 76.15 82.91 80.38 54.24 67.02 68.35 69.81 71.46 79.26 81.24 52.83 60.19 63.25
3 country/region Argentina walking NaN NaN NaN 100.0 95.11 101.37 112.67 116.72 114.14 84.54 101.37 106.12 104.91 102.56 104.88 98.83 73.64 94.27 93.86 92.20 96.00 98.99 97.42 67.21 95.18 96.25 100.39 99.10 105.00 106.45 73.12 104.20 105.87 111.47 113.20 128.75 103.47 ... NaN 91.41 51.49 68.86 66.24 73.29 78.94 90.69 80.48 54.61 68.93 85.16 59.04 62.92 86.99 83.76 47.22 65.93 73.72 82.34 82.38 77.44 91.32 46.58 64.34 69.24 70.52 66.69 71.23 63.38 43.43 63.50 64.89 67.36 66.05 70.13 68.50 40.28 56.19 58.81
4 country/region Australia driving AU NaN NaN 100.0 102.98 104.21 108.63 109.08 89.00 99.35 103.53 106.80 107.40 115.65 109.66 88.59 85.18 95.91 99.40 99.92 103.04 103.73 85.73 93.39 95.83 98.63 102.40 107.46 106.55 86.33 98.78 100.08 103.28 106.19 116.62 115.36 93.51 ... NaN 98.87 102.82 107.66 110.07 111.52 120.61 118.89 97.19 102.72 101.59 110.29 115.88 125.08 125.53 103.02 106.90 103.76 108.94 125.71 114.77 107.45 89.61 100.99 108.20 118.36 123.80 124.67 129.30 105.68 112.85 117.10 120.98 124.67 131.33 128.89 106.49 106.19 109.87 113.70

5 rows × 470 columns

Activity: Discussing data formats

Compare the format of the data in the Apple Mobility Trends Reports to the format of the data in the Google COVID-19 Community Mobility Reports and in Our World in Data COVID-19 data. Focus in particular on similarities and differences in how time series data is represented across data sets.

Recording dependencies

# Install the watermark extension
!pip install watermark

# Load the watermark extension
%load_ext watermark

# Show packages that were imported
%watermark --iversions
Requirement already satisfied: watermark in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (2.2.0)
Requirement already satisfied: ipython in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from watermark) (7.19.0)
Requirement already satisfied: pexpect>4.3 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (4.8.0)
Requirement already satisfied: backcall in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.2.0)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (3.0.8)
Requirement already satisfied: setuptools>=18.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (50.3.2.post20201201)
Requirement already satisfied: traitlets>=4.2 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (5.0.5)
Requirement already satisfied: pygments in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (2.7.2)
Requirement already satisfied: jedi>=0.10 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.17.1)
Requirement already satisfied: pickleshare in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.7.5)
Requirement already satisfied: decorator in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (4.4.2)
Requirement already satisfied: appnope in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from ipython->watermark) (0.1.2)
Requirement already satisfied: parso<0.8.0,>=0.7.0 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from jedi>=0.10->ipython->watermark) (0.7.0)
Requirement already satisfied: ptyprocess>=0.5 in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from pexpect>4.3->ipython->watermark) (0.6.0)
Requirement already satisfied: wcwidth in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython->watermark) (0.2.5)
Requirement already satisfied: ipython-genutils in /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages (from traitlets>=4.2->ipython->watermark) (0.2.0)
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
matplotlib: 3.3.2
pandas    : 1.1.3
seaborn   : 0.11.0