Python for Data Analysis on the Cloud

Get Started with Jupyter & Colab notebooks

In this course, we will use the Jupyter notebook on the cloud.

The Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualisations, and narrative text.

There are many free services that allow you to run Jupyter notebooks on the cloud. We will use Colab and Binder.

Colab is a free environment that runs Jupyter notebooks on the Google Cloud and requires no install or setup. You can view notebooks shared publicly without a Google sign-in. In order to execute and change code interactively, a Google account sign-in is required. You can find more information on how to create a Google account here. You can learn more about the Colab and how to open a new notebook here.

MyBinder is a free, community-led infrastructure that opens Jupyter notebooks in an interactive and reproducible environment and requires no install or setup. Binder requires no account registration or log-in to execute and change code interactively.



Learning resources

The Jupyter Notebook. Jupyter Team.

Jupyter Notebook Tutorial. DataCamp.

Get started with Google Colaboratory. Jake VanderPlas, Coding TensorFlow.

A hands-on introduction to Python for beginning programmers. Jessica McKellar, PyCon 2014.

Charles Severance. Python for Everybody: Exploring Data In Python 3.

David Amos et al. Python Basics: A Practical Introduction to Python. The authors make available sample chapters.

Getting Started With the Open Science Framework (OSF). Center for Open Science.


Structure of Colab notebook

The Colab notebook has a simple structure that consists of three parts:

  • Menubar

  • Toolbar

  • Cells

For an example of those features, see an example Colab notebook Overview of Colaboratory Features below.

Manage Jupyter Notebook Files

To open a new notebook:

File -> New notebook

To rename a notebook:

File -> Rename notebook

To access revision history:

File -> Revision history

To check whether the cells in your notebook are executable in linear order:

Runtime -> Restart and run all

To share your Colab notebook and collaborate, you can create a shareable link by clicking the Share button at the top right of your Colab notebook and specify the type of access (e.g., Anyone with the link can view the notebook). You can then copy and share the link with your collaborators.

Cells

The notebook has two types of cells: code and text. You can add new cells by using the + Code and + Text buttons that are in the toolbar above the notebook and also appear when you hover between a pair of cells.

Python Coding for Data Analysis

Below is a code cell, in which we type in the arithmetic expression 21 + 21.

The code is prefixed by a comment. Commenting your code is a good practice and part of your reproducible workflow. Comments in Python’s code cells start with a hashtag symbol # followed by a single space and some text. The text that follows the hashtag symbol on the same line is marked as a comment and is not therefore evaluated by the Python interpreter. Only the code (in this instance, “21 + 21”) is evaluated and the output (in this instance, “42”) will be displayed below the code cell.

To execute the cell, press Shift + Enter or click the Play icon on the left.

# Perform a basic arithmetic operation of addition
21 + 21
42

Python reads the code entered in the cell, evaluates it, and prints the result (42).

Create a toy data set and perform basic data analysis

Let’s create a list of the whole numbers (or integers) 4, 2, 8, 6.

Lists are one of the built-in data types in Python. Elements in a list are separated by comma , and are enclosed in square brackets []:

[4, 2, 8, 6]  # create a list
[4, 2, 8, 6]

The comment # create a list is an example of an inline comment. Inline comments refers to a code statement on the same line. Inline comments are separated by at least two spaces from the code statement. Similar to block comments, inline comments start with a hashtag symbol # followed by a single space and some text.

Let’s assign the list of numbers to a variable called even_numbers using the =, which is called the assignment operator.

even_numbers = [4, 2, 8, 6]

You can now apply built-in functions from the Python Standard Library to the variable even_numbers. A function is a block of code that:

  • takes input parameters

  • performs a specific task

  • returns an output.

Python has various built-in functions, including min(), max(), sorted(). Take the function min(). Using our example of even numbers above, the function min() will take as an input parameter the four numbers to compare, perform the comparison, and return the number with the lowest value.

We call a function by writing the function name followed by parenthesis. The function min() takes only one input parameter, the input data in the form of a list or another Python data type we will discuss later in the course. When we call the function, we pass our list even_numbers as an argument inside the parentheses.

An argument is different from a parameter, although the two terms are often used interchangeably. A parameter is the generic variable listed inside the parentheses of a function, whereas an argument is the actual value or data you pass to the function when you call it.

# Find the number with the lowest value
min(even_numbers)
2

We can apply any other built-in Python function. For example, the max() function returns, intuitively, the number with the highest value.

# Find the number with the highest value
max(even_numbers)
8

The function sorted() returns a sorted list of numbers in increasing order. We assign the resulting sorted list to a variable named sorted_even_numbers:

# Sort numbers in increasing order
sorted_even_numbers = sorted(even_numbers)
sorted_even_numbers
[2, 4, 6, 8]

Functions often have multiple parameters. For example, in addition to your input data, the sorted() function takes two optional parameters. One of these parameters is reverse. If you type in reverse = True, the list will be sorted in descending order:

sorted(even_numbers, reverse=True)
[8, 6, 4, 2]

Our list of even numbers is now sorted in descending order.

Descriptive statistics with NumPy

Functions for scientific computing, data analysis, machine learning, and statistical modeling are not in the Python Standard Library but are part of Python libraries or packages. For example, the Python library NumPy for scientific computing includes functions for computing the mean (average of the numbers) and standard deviation (variation or dispersion of the numbers from the mean).

To use NumPy, we first import the library and, by convention, give it the alias np.

# Import the NumPy library
import numpy as np

Now that we imported the module NumPy as np, you can view help on the module by typing the module alias followed by a question mark ?

np?

We now use np. to append each function we will use from NumPy, for example the functions mean() and std.

# Compute the mean of our list of numbers
np.mean(even_numbers)
5.0
# Compute the standard deviation of our list of numbers
np.std(even_numbers)
2.23606797749979

Getting help

In Jupyter and Colab, you can access help information by using the help() function or a question mark ?. For example, to access help information about a function in the Python Standard Library, such as min(), you type in

help(min)

# Alternatively, you can symply type
min?
Help on built-in function min in module builtins:

min(...)
    min(iterable, *[, default=obj, key=func]) -> value
    min(arg1, arg2, *args, *[, key=func]) -> value
    
    With a single iterable argument, return its smallest item. The
    default keyword-only argument specifies an object to return if
    the provided iterable is empty.
    With two or more arguments, return the smallest argument.

Note

In Jupyter, you access help by pressing Shift + Tab when you are typing in a cell in edit mode. See this tutorial for the difference between edit mode and command mode.

# Get help about the NumPy function mean()
help(np.mean)
Help on function mean in module numpy:

mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)
    Compute the arithmetic mean along the specified axis.
    
    Returns the average of the array elements.  The average is taken over
    the flattened array by default, otherwise over the specified axis.
    `float64` intermediate and return values are used for integer inputs.
    
    Parameters
    ----------
    a : array_like
        Array containing numbers whose mean is desired. If `a` is not an
        array, a conversion is attempted.
    axis : None or int or tuple of ints, optional
        Axis or axes along which the means are computed. The default is to
        compute the mean of the flattened array.
    
        .. versionadded:: 1.7.0
    
        If this is a tuple of ints, a mean is performed over multiple axes,
        instead of a single axis or all the axes as before.
    dtype : data-type, optional
        Type to use in computing the mean.  For integer inputs, the default
        is `float64`; for floating point inputs, it is the same as the
        input dtype.
    out : ndarray, optional
        Alternate output array in which to place the result.  The default
        is ``None``; if provided, it must have the same shape as the
        expected output, but the type will be cast if necessary.
        See `ufuncs-output-type` for more details.
    
    keepdims : bool, optional
        If this is set to True, the axes which are reduced are left
        in the result as dimensions with size one. With this option,
        the result will broadcast correctly against the input array.
    
        If the default value is passed, then `keepdims` will not be
        passed through to the `mean` method of sub-classes of
        `ndarray`, however any non-default value will be.  If the
        sub-class' method does not implement `keepdims` any
        exceptions will be raised.
    
    Returns
    -------
    m : ndarray, see dtype parameter above
        If `out=None`, returns a new array containing the mean values,
        otherwise a reference to the output array is returned.
    
    See Also
    --------
    average : Weighted average
    std, var, nanmean, nanstd, nanvar
    
    Notes
    -----
    The arithmetic mean is the sum of the elements along the axis divided
    by the number of elements.
    
    Note that for floating-point input, the mean is computed using the
    same precision the input has.  Depending on the input data, this can
    cause the results to be inaccurate, especially for `float32` (see
    example below).  Specifying a higher-precision accumulator using the
    `dtype` keyword can alleviate this issue.
    
    By default, `float16` results are computed using `float32` intermediates
    for extra precision.
    
    Examples
    --------
    >>> a = np.array([[1, 2], [3, 4]])
    >>> np.mean(a)
    2.5
    >>> np.mean(a, axis=0)
    array([2., 3.])
    >>> np.mean(a, axis=1)
    array([1.5, 3.5])
    
    In single precision, `mean` can be inaccurate:
    
    >>> a = np.zeros((2, 512*512), dtype=np.float32)
    >>> a[0, :] = 1.0
    >>> a[1, :] = 0.1
    >>> np.mean(a)
    0.54999924
    
    Computing the mean in float64 is more accurate:
    
    >>> np.mean(a, dtype=np.float64)
    0.55000000074505806 # may vary

Readability of your Python code

To write readable, consistent, and clean code, it is important to consult the PEP (Python Enhancement Proposal) 8 style guide for Python code. The guide describes the rules for writing a readable Python code. Below are outlined key rules to keep in mind when you write your Python code (for details and examples, check out the PEP 8 style guide).

  • Naming conventions

    • Variable names (e.g., even_numbers) should be lowercase, with words separated by underscores as necessary to improve readability.

    • Names to avoid (as, in some fonts, they are indistinguishable from the numerals one and zero): the characters ‘l’ (lowercase letter el), ‘O’ (uppercase letter oh), or ‘I’ (uppercase letter eye) as single character variable names.

  • Comments

    • PEP 8 distinguishes block comments, inline comments, and documentation strings (or docstrings for short):

      • Block comments in Python’s code cells apply to the lines of code that follows them, and follow the same indentation as the code. Block comments are typically formed of complete sentences, each sentence starting with a capitalized word and ending in a period. Each line of a block comment starts with a hashtag symbol # followed by a single space and some text.

      • Line comments appears on the same line as a statement and is separated by at least two spaces from the statement. Each line comment starts with a hashtag symbol # followed by a single space and some text.

      • Dockstrings are used to document Python modules, functions, classes, or methods. Docstrings are surrounded by """triple double quotes""". You likely read docstrings more often (for example, when accessing help information about a function) than write them. See PEP 8 for details on docstrings.

  • Maximum line length

    • Limit all lines to a maximum of 79 characters. To split a long command over multiple lines, one can break down code into readable statements using parenthesis. Another approach is to use the backslash symbol \.

    • Long blocks of text (comments) should be limited to 72 characters.

    • When applicable and agreed upon, it would be okay to increase the line length limit up to 99 characters, provided that comments are still limited to 72 characters.

  • Indentation

    • Indentation in Python is the spaces at the beginning of a code line. Indentation is important not only because of code readability but particularly because in Python statements arranged at the same indentation level are considered to form part of a single code block. So, incorrect indentation in Python would likely produce an IndentationError or an errorless output.

    • Use 4 spaces per indentation level.

  • Tabs or Spaces?

    • Spaces are the preferred indentation method.

    • Python disallows mixing tabs and spaces for indentation.

  • Whitespace in Expressions and Statements

    • Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). (e.g., sorted_even_numbers = sorted(even_numbers))

    • Don’t use spaces around the = sign when used to indicate a keyword argument (e.g., reverse=True)

As mentioned in PEP 8, section A Foolish Consistency is the Hobgoblin of Little Minds, style guide recommendations may not be applicable in some circumstances. In such circumstances, you could look for examples from the Python community and use your best judgment.

Running command-line commands in Jupyter/Colab notebook

In addition to running Python code, we will also execute command-line commands in our Jupyter notebooks. Command-line commands are very useful for obtaining information (for example, you can check the version of your Jupyter notebook, Python, or packages), managing computer files, and installing Python packages.

To execute command-line commands, you would typically need to use a command-line interface (CLI) such as the Terminal (macOS) or Command Prompt (Windows), which can be challenging. Fortunately, Jupyter notebook allows you to run command-line commands in the notebook code cells by prepending an exclamation mark (!) to the beginning of the command. Any command appearing after the mark in the line will not be executed from the Python environment but from your operating system’s command-line interface (CLI). You can think of the exclamation mark (!) as introducing command-line interface (You can learn more about the command-line interfaces from this tutorial by The Carpentries).

As an example, you can determine the version of the Jupyter notebook you use by typing the line below, in which the question mark (!) is followed by the command jupyter-notebook and the flag --version. Command-line flags are used to specify options and modify command’s execution. As an output, the command prints the version of your active Jupyter notebook.

!jupyter-notebook --version
6.1.4

Installing packages with pip

Command-line tools are particularly useful for installing packages that are not part of the Python standard library. To install packages, we will use the Python’s packaging manager pip. A list of all Python packages available for instalation via pip can be found at https://pypi.org.

Because pip is a command-line tool, we will prepend an exclamation mark (!) to the package installation command every time we use pip. The pip install command supports flags, for example we will specify the flag -q (short for --quiet) that hides output/warnings which may cause confusion initially. For example, to install the Python package for statistical modeling, statsmodels, we would type:

!pip install -q statsmodels

The command above installs the most recent version of a package, which you can determine (along other useful information about the package) by typing:

!pip show statsmodels
Name: statsmodels
Version: 0.12.2
Summary: Statistical computations and models for Python
Home-page: https://www.statsmodels.org/
Author: 
Author-email: 
License: BSD License
Location: /Users/valentindanchev/opt/anaconda3/lib/python3.8/site-packages
Requires: numpy, pandas, patsy, scipy
Required-by: dowhy

If you are interested in installing a previous version of statsmodels (for example, for the purpose of reproducing a research project which uses a previous version), you can find the available previous versions at https://pypi.org and install a version of choice by typing the version number, for example:

!pip install -q statsmodels==0.12.2

Jupyter magic commands

Jupyter magic commands are special commands that add to the Python syntax and provide capabilities that help researchers solve various research problems related to data analysis and workflow. Magic commands are prefixed in the Jupyter notebook by a single % character when they operate on one line of code (known as line magics) or by double %% characters when they operate on multiple lines of code (known as cell magics).

You can list all available magic commands by using the line magic %lsmagic as shown below.

%lsmagic
Available line magics:
%alias  %alias_magic  %autoawait  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %conda  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %pip  %popd  %pprint  %precision  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%markdown  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Some of the popular magic commands we will use include:

  • Printing out the execution time of a Python command

%time np.mean(even_numbers)
CPU times: user 77 µs, sys: 41 µs, total: 118 µs
Wall time: 121 µs
5.0
  • Listing all variables already defined in the current notebook

%who
even_numbers	 np	 sorted_even_numbers	 
  • Displaying matlpotlib/Seaborn graphs in the notebook (applies to older versions of Jupyter notebook, obsolete in newer versions)

%matplotlib inline

You can read more about the Jupyter magic commands here and here.

Loading real-world data with pandas

So far, we have used a toy data example. We will use the pandas library to load a real-world data set. We will learn about pandas next session, here will just use key data loading functionality. Let’s first import the pandas library and, by convention, give it the alias pd.

# Import the pandas library
import pandas as pd

Data on Covid-19 by Our World in Data

We will load and explore the Data on COVID-19 by Our World in Data (OWID). Details about the data are available on this GitHub repository:

It is updated daily and includes data on confirmed cases, deaths, hospitalizations, testing, and vaccinations as well as other variables of potential interest.

The data on COVID-19 by Our World in Data is provided as a comma-separated values (CSV) file. We load the CSV file into Python using the read_csv() function from pandas. There is no need to download the file on your local computer or the cloud. We just specify the URL and use the code below to load the most recent online version of the data. We also assign the loaded data set to a variable called owid_covid.

owid_covid = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

We can now perform various operations on the data object by using the so called methods. Examples of pandas methods are head() and tail(); head() displays by default the top five rows of the data and tail() displays by default the last five rows. You can display a custom number of row by passing that number in in the brackets.

Let’s display the top five rows using the method head():

# View the top five rows of the data set
owid_covid.head()
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
0 AFG Asia Afghanistan 2020-02-24 5.0 5.0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
1 AFG Asia Afghanistan 2020-02-25 5.0 0.0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
2 AFG Asia Afghanistan 2020-02-26 5.0 0.0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
3 AFG Asia Afghanistan 2020-02-27 5.0 0.0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN
4 AFG Asia Afghanistan 2020-02-28 5.0 0.0 NaN NaN NaN NaN ... NaN NaN 37.746 0.5 64.83 0.511 NaN NaN NaN NaN

5 rows × 67 columns

# View the last five rows of the data set
owid_covid.tail()
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
189721 ZWE Africa Zimbabwe 2022-05-23 250702.0 60.0 181.571 5495.0 1.0 1.571 ... 1.6 30.7 36.791 1.7 61.49 0.571 NaN NaN NaN NaN
189722 ZWE Africa Zimbabwe 2022-05-24 250929.0 227.0 168.714 5496.0 1.0 1.714 ... 1.6 30.7 36.791 1.7 61.49 0.571 NaN NaN NaN NaN
189723 ZWE Africa Zimbabwe 2022-05-25 251228.0 299.0 174.429 5496.0 0.0 1.429 ... 1.6 30.7 36.791 1.7 61.49 0.571 NaN NaN NaN NaN
189724 ZWE Africa Zimbabwe 2022-05-26 251646.0 418.0 205.714 5498.0 2.0 1.571 ... 1.6 30.7 36.791 1.7 61.49 0.571 NaN NaN NaN NaN
189725 ZWE Africa Zimbabwe 2022-05-27 251959.0 313.0 212.857 5500.0 2.0 1.571 ... 1.6 30.7 36.791 1.7 61.49 0.571 NaN NaN NaN NaN

5 rows × 67 columns

Describe the Covid-19 data

In addition to pandas methods, we can use attributes to access information about the metadata. For example, the attribute shape gives the dimensions of a DataFrame.

# Number of rows and columns in the data set
owid_covid_shape = owid_covid.shape
owid_covid_shape
(189726, 67)

The returned object is called tuples. Like lists, tuples contain a collection of data elements. But unlike lists, which are mutable, tuples are immutable, meaning that the element values cannot change. Also, compared to lists in which elements are inside square brackets [], elements in tuples are inside parentheses ().

To access a particular value in tuple, we use the square brackets. For example, to access the first element (i.e., number of rows) of the tuple owid_covid_shape, we type in

owid_covid_shape[0]
189726

To access the second element (i.e., number of columns) of the tuple, we type in

owid_covid.shape[1]
67

Note that indexing in Python starts from 0, so first element is index 0, second is index 1, and so on.

As of 11 May 2021, the data set contains 87,310 rows and 59 columns. But this is a live data set in which the number of rows are updated daily. To display these updates, we could use the print() function. The line of code below contains background text in quotes '' and the up-to-date number of rows owid_covid.shape[0] and number of columns owid_covid.shape[1] that will be inserted in the sentence each time the cell is executed.

print(
    "In the most current data on COVID-19 by Our World in Data, the number of rows is",
    owid_covid.shape[0],
    "and the number of columns is",
    owid_covid.shape[1],
)
In the most current data on COVID-19 by Our World in Data, the number of rows is 189726 and the number of columns is 67

In addition to the dimensions of the data set, we can access other metadata using attributes. For example, we can access the column labels of the data set using the attribute columns:

owid_covid.columns
Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'total_tests', 'new_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred',
       'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred',
       'new_vaccinations_smoothed_per_million',
       'new_people_vaccinated_smoothed',
       'new_people_vaccinated_smoothed_per_hundred', 'stringency_index',
       'population', 'population_density', 'median_age', 'aged_65_older',
       'aged_70_older', 'gdp_per_capita', 'extreme_poverty',
       'cardiovasc_death_rate', 'diabetes_prevalence', 'female_smokers',
       'male_smokers', 'handwashing_facilities', 'hospital_beds_per_thousand',
       'life_expectancy', 'human_development_index',
       'excess_mortality_cumulative_absolute', 'excess_mortality_cumulative',
       'excess_mortality', 'excess_mortality_cumulative_per_million'],
      dtype='object')

Display a concise summary of the DataFrame using the method info()

owid_covid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189726 entries, 0 to 189725
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    189726 non-null  object 
 1   continent                                   178691 non-null  object 
 2   location                                    189726 non-null  object 
 3   date                                        189726 non-null  object 
 4   total_cases                                 182231 non-null  float64
 5   new_cases                                   182021 non-null  float64
 6   new_cases_smoothed                          180847 non-null  float64
 7   total_deaths                                163807 non-null  float64
 8   new_deaths                                  163798 non-null  float64
 9   new_deaths_smoothed                         162635 non-null  float64
 10  total_cases_per_million                     181390 non-null  float64
 11  new_cases_per_million                       181180 non-null  float64
 12  new_cases_smoothed_per_million              180011 non-null  float64
 13  total_deaths_per_million                    162979 non-null  float64
 14  new_deaths_per_million                      162970 non-null  float64
 15  new_deaths_smoothed_per_million             161812 non-null  float64
 16  reproduction_rate                           140710 non-null  float64
 17  icu_patients                                25287 non-null   float64
 18  icu_patients_per_million                    25287 non-null   float64
 19  hosp_patients                               26547 non-null   float64
 20  hosp_patients_per_million                   26547 non-null   float64
 21  weekly_icu_admissions                       6147 non-null    float64
 22  weekly_icu_admissions_per_million           6147 non-null    float64
 23  weekly_hosp_admissions                      12267 non-null   float64
 24  weekly_hosp_admissions_per_million          12267 non-null   float64
 25  total_tests                                 77125 non-null   float64
 26  new_tests                                   73556 non-null   float64
 27  total_tests_per_thousand                    77125 non-null   float64
 28  new_tests_per_thousand                      73556 non-null   float64
 29  new_tests_smoothed                          99918 non-null   float64
 30  new_tests_smoothed_per_thousand             99918 non-null   float64
 31  positive_rate                               92182 non-null   float64
 32  tests_per_case                              90545 non-null   float64
 33  tests_units                                 102647 non-null  object 
 34  total_vaccinations                          51818 non-null   float64
 35  people_vaccinated                           49369 non-null   float64
 36  people_fully_vaccinated                     46824 non-null   float64
 37  total_boosters                              23926 non-null   float64
 38  new_vaccinations                            42442 non-null   float64
 39  new_vaccinations_smoothed                   102364 non-null  float64
 40  total_vaccinations_per_hundred              51818 non-null   float64
 41  people_vaccinated_per_hundred               49369 non-null   float64
 42  people_fully_vaccinated_per_hundred         46824 non-null   float64
 43  total_boosters_per_hundred                  23926 non-null   float64
 44  new_vaccinations_smoothed_per_million       102364 non-null  float64
 45  new_people_vaccinated_smoothed              101351 non-null  float64
 46  new_people_vaccinated_smoothed_per_hundred  101351 non-null  float64
 47  stringency_index                            147714 non-null  float64
 48  population                                  188568 non-null  float64
 49  population_density                          169064 non-null  float64
 50  median_age                                  156690 non-null  float64
 51  aged_65_older                               155029 non-null  float64
 52  aged_70_older                               155868 non-null  float64
 53  gdp_per_capita                              155859 non-null  float64
 54  extreme_poverty                             101750 non-null  float64
 55  cardiovasc_death_rate                       156337 non-null  float64
 56  diabetes_prevalence                         163976 non-null  float64
 57  female_smokers                              118246 non-null  float64
 58  male_smokers                                116625 non-null  float64
 59  handwashing_facilities                      76812 non-null   float64
 60  hospital_beds_per_thousand                  138717 non-null  float64
 61  life_expectancy                             177427 non-null  float64
 62  human_development_index                     152298 non-null  float64
 63  excess_mortality_cumulative_absolute        6525 non-null    float64
 64  excess_mortality_cumulative                 6525 non-null    float64
 65  excess_mortality                            6525 non-null    float64
 66  excess_mortality_cumulative_per_million     6525 non-null    float64
dtypes: float64(62), object(5)
memory usage: 97.0+ MB

Let’s see what is the highest number of fully vaccinated adults in a country to date. We will use the variable people_fully_vaccinated_per_hundred and pandas’ max() method.

owid_covid["people_fully_vaccinated_per_hundred"].max()
122.94

Once we know the highest number of fully vaccinated adults per hundred, we determine the index of the observation (country) associated with that number using the pandas’ method idmax().

index_highest_vaccination = owid_covid["people_fully_vaccinated_per_hundred"].idxmax()
index_highest_vaccination
66501

Finally, we select the particular row (country in this case) by its location in the index using iloc.

owid_covid.iloc[[index_highest_vaccination]]
iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed ... female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
66501 GIB Europe Gibraltar 2022-04-21 17706.0 212.0 30.286 101.0 0.0 0.0 ... NaN NaN NaN NaN 79.93 NaN NaN NaN NaN NaN

1 rows × 67 columns

The above example is only for illustration of pandas functionality. We will unpack different approaches for data slicing and selecting with pandas in section Data Design and Data Wrangling.

Hands-on exercise: Let’s practice your Python skills

  1. Open a new Colab notebook.

  2. Add a new Code cell and include the following:

    • Import the pandas library and add a comment describing your code

    • Load the Data on COVID-19 by Our World in Data and assign the data to a variable name. Use an informative variable name that clearly describes what the data is about; you can use underscores to separate words in your variable name, e.g. COVID19_Data_Our_World_in_Data. Add a comment.

  3. For the loaded data set:

    • Show the first 20 observations/rows

    • Show the last 5 observations/rows

  4. When ready, share your Colab notebook with the class:

    • Create a shareable link by clicking the Share button at the top right of your Colab notebook

    • Specify that Anyone with the link can view the notebook

    • Share the link with the class.

Before sharing your notebook, rerun your notebook from top to bottom using Restart and run all (under Runtime in the Colab menu bar) to ensure that your data analysis is computationally reproducible.

Markdown Text Cells

Double click a text cell to see the Markdown syntax used to format the text.

Headings

Headings begin with the hash symbol ‘#,’ followed by the space. There are six Headings. To create the largest heading, add one hash symbol, and to create the smallest heading, add six hash symbols.

# Header 1, Title example (Not shown below)
## Header 2, Subtitle example
### Header 3,
#### Header 4
##### etc.

Header 2, Subtitle example

Header 3,

Header 4

etc.

Lists

* Item one (Note the space after *)
* Item two
* Item three
  * Sub-bullet
    * Sub-sub-bullet

  • Item one

  • Item two

  • Item three

    • Sub-bullet

      • Sub-sub-bullet

Bold and Italic Text

**This text is bold**
*This text is italic*
~This was mistaken text~
***This is both bold and italic text***

This text is bold

This text is italic

A second line of stricken code

This is both bold and italic text

Hands-on exercise: Let’s practice your Markdown skills

  1. Create a new Colab notebook file.

  2. Add a new Text cell and include:

  • A notebook title (e.g. My Reproducible Research Workflow)

  • A bullet list with:

    • A bold word for Author: and then add italised text for your name.

    • A bold word for Affiliation: and then add italised text for your University and Department.

    • A bold word for Date: and then add text for today’s date.

  • Add another Text cell and include:

    • A list of at least two online datasets about Covid-19 you find interesting.

    • Add a hyperlink to each database in your list and include the name of the database in the title of the hyperlink.

  • Add another Text cell and write a short research question you would be interesting in addressing using the Covid-19 datasets.

When you complete your exercise, download your notebook and make it available on an Open Science Framework (OSF) repository. You can download your Colab notebook from File and then Download .ipynb. You can learn how to create an OSF repository and deposit your notebook from the tutorial Getting Started With the Open Science Framework by the Center for Open Science.

This exercise draws on Earth Lab’s lesson Format Text In Jupyter Notebook With Markdown.