Discovering patterns in data¶
This lab will first introduce you to key concepts in machine learning for social science. Our focus will be on a particular branch of machine learning called unsupervised learning, which includes techniques for clustering and dimensionality reduction. We will then focus on hands-on data analysis with the scikit-learn
library for machine learning in Python. Our research objective is to group UK counties with similar mobility trends using two popular techniques of unsupervised learning: k-means clustering and Principal Components Analysis (PCA).
Key themes¶
Definition of machine learning.
Supervised and unsupervised learning.
Introduction to unsupervised learning techniques, including clustering (k-means) and dimensionality reduction (Principal Component Analysis (PCA)).
Hands-on machine learning with
scikit-learn
.Data-informed model parameter selection.
Learning resources¶
M Molina & F Garip. 2019. Machine learning for sociology. Annual Review of Sociology. Link to an open-access version of the article available at the Open Science Framework.
Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane. 2021. Chapter 7: Machine Learning. In Big Data and Social Science (2nd edition).
What is Machine Learning? OxfordSparks.
Additional resources¶
Kosuke Imai. 2018. Chapter 3.7.3: The k-means algorithm. In Quantitative Social Science. Princeton University Press.
Jake VanderPlas. 2016. In Depth: k-Means Clustering. In Python Data Science Handbook.
Sebastian Raschka. 2018. Python Machine Learning. Packt Publishing.
Machine Learning: What is it? What is it good for?¶
Field of study that gives computers the ability to learn [from data] without being explicitly programmed.
—Arthur Samuel, 1959
Data science tasks we can solve using machine learning¶
Pattern discovery using unsupervised machine learning
Prediction using supervised machine learning
Unsupervised and Supervised learning¶
Two types of machine learning are often distinguished in the literature: unsupervised learning and supervised learning
Unsupervised learning — no outcome variable / labeled data are available, and the structure of data is unknown. The goal of unsupervised learning is to explore the structure of data and discover hidden structures and meaningful information without the guidance of outcome variable / labeled data. To uncover such hidden structures in data, we use unsupervised learning techniques, including clustering (e.g., k-Means) and dimensionality reduction (e.g., Principal Component Analysis (PCA)).
Unsupervised Learning, Machine Learning’s course by Andrew Ng
Supervised learning — learn a model from labeled training data or outcome variable that would enable us to make predictions about unseen or future data. The learning is called supervised because the outcome variable as well as labels (e.g., email Spam or Ham where ‘Ham’ is e-mail that is not Spam) that guide the learning process are already known.
Supervised learning, Machine learning’s course by Andrew Ng
In this lab, we will be focusing on unsupervised learning.
Research problem: clustering counties by mobility¶
Let’s formulate our simple research problem: to inform a public health intervention, we need to group a number of counties in the UK with similar mobility trends. We frame this problem as a clustering task and perform k-means clustering to sort the UK counties into clusters with similar mobility trends.
k-means clustering¶
Clustering is an exploratory data analysis (EDA) task that aims to group a set of observations into subgroups or clusters (without any prior information about cluster membership) such that observations assigned to the same cluster are more similar to each other than those in other clusters. To cluster observations in our mobility data, we will employ the k-means algorithm.
The k-means algorithm¶
The k-means algorithm is an iterative algorithm in which a set of operations are repeatedly performed until a noticeable difference in results is no longer produced. The goal of the algorithm is to split the data into k similar groups where each group is associated with its centroid, which is equal to the within-group mean. This is done by first assigning each observation to its closest cluster and then computing the centroid of each cluster based on this new cluster assignments. These two step are iterated until the cluster assignment no longer changes.
The k-means algorithm produces the prespecified number of clusters k and consists of the following steps:
Choose the initial centroids of k clusters.
Given the centroids, assign each observations to a cluster whose centroid is the closest to that observation.
Choose the new centroid of each cluster whose coordinate equals the within-cluster mean of the corresponding variable.
Repeat Steps 2 and 3 until cluster assignment no longer change.
—Kosuke Imai. 2018. Quantitative Social Science. Princeton University Press.
See also Jake VanderPlas’ Python Data Science Handbook.
On k-Means Advantages and Disadvantages, read here.
Recent applications of k-means clustering in social sciences¶
Garip, F. 2012. Discovering diverse mechanisms of migration: The Mexico–US Stream 1970–2000. Population and Development Review, 38(3), 393-433. Open access version.
Bail, C. A. (2008). The configuration of symbolic boundaries against immigrants in Europe. American Sociological Review, 73(1), 37-59.
Let’s get coding with scikit-learn
¶
Scikit-learn is simple, efficient, and widely used library for machine learning in Python.
# Import libraries for today's lab
# Data analysis & visualisation
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_theme(font_scale=1.5)
%matplotlib inline
# Suppress warnings to avoid potential confusion
import warnings
# Machine learning
from sklearn.cluster import KMeans # For performing k-means
from sklearn.decomposition import PCA # For performing PCA
from sklearn.preprocessing import StandardScaler # For standartising data
warnings.filterwarnings("ignore")
Load and process the mobility data, select UK data and compute mean mobility trends¶
# Load the mobility data
mobility_trends_complete = pd.read_csv(
"https://www.gstatic.com/covid19/mobility/Global_Mobility_Report.csv",
parse_dates=["date"],
)
mobility_trends_complete
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-15 | 0.0 | 4.0 | 5.0 | 0.0 | 2.0 | 1.0 |
1 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-16 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 | 1.0 |
2 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-17 | -1.0 | 1.0 | 5.0 | 1.0 | 2.0 | 1.0 |
3 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-18 | -2.0 | 1.0 | 5.0 | 0.0 | 2.0 | 1.0 |
4 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-19 | -2.0 | 0.0 | 4.0 | -1.0 | 2.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9900501 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-10 | NaN | NaN | NaN | NaN | 126.0 | NaN |
9900502 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-11 | NaN | NaN | NaN | NaN | 129.0 | NaN |
9900503 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-12 | NaN | NaN | NaN | NaN | 116.0 | NaN |
9900504 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-13 | NaN | NaN | NaN | NaN | 118.0 | NaN |
9900505 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2022-05-16 | NaN | NaN | NaN | NaN | 100.0 | NaN |
9900506 rows × 15 columns
Select a manageable subset of the dataset covering the period from 15 February 2020 to 30 June 2021 using the functions subperiod_mobility_trends()
and rename_mobility_trends()
we defined in the previous chapter Exploratory Data Analysis and Visualisation.
# %load preprocess_mobility_trends.py
def subperiod_mobility_trends(data, start_date, end_date):
"""
Add your mobility data in `data`.
This function selects a subperiod of the mobility data based on prespecified start data and end date.
"""
mobility_trends = data[
data["date"].isin(pd.date_range(start=start_date, end=end_date))
]
return mobility_trends
def rename_mobility_trends(data):
"""
This function renames the column headings of the six mobility categories.
"""
mobility_trends_renamed = data.rename(
columns={
"retail_and_recreation_percent_change_from_baseline": "Retail_Recreation",
"grocery_and_pharmacy_percent_change_from_baseline": "Grocery_Pharmacy",
"parks_percent_change_from_baseline": "Parks",
"transit_stations_percent_change_from_baseline": "Transit_stations",
"workplaces_percent_change_from_baseline": "Workplaces",
"residential_percent_change_from_baseline": "Residential",
}
)
return mobility_trends_renamed
mobility_trends = subperiod_mobility_trends(
mobility_trends_complete, "2020-02-15", "2021-06-30"
)
mobility_trends
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | retail_and_recreation_percent_change_from_baseline | grocery_and_pharmacy_percent_change_from_baseline | parks_percent_change_from_baseline | transit_stations_percent_change_from_baseline | workplaces_percent_change_from_baseline | residential_percent_change_from_baseline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-15 | 0.0 | 4.0 | 5.0 | 0.0 | 2.0 | 1.0 |
1 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-16 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 | 1.0 |
2 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-17 | -1.0 | 1.0 | 5.0 | 1.0 | 2.0 | 1.0 |
3 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-18 | -2.0 | 1.0 | 5.0 | 0.0 | 2.0 | 1.0 |
4 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-19 | -2.0 | 0.0 | 4.0 | -1.0 | 2.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9900273 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-24 | NaN | NaN | NaN | NaN | 7.0 | NaN |
9900274 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-25 | NaN | NaN | NaN | NaN | 13.0 | NaN |
9900275 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-28 | NaN | NaN | NaN | NaN | -3.0 | NaN |
9900276 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-29 | NaN | NaN | NaN | NaN | 12.0 | NaN |
9900277 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-30 | NaN | NaN | NaN | NaN | 5.0 | NaN |
5970596 rows × 15 columns
mobility_trends = rename_mobility_trends(mobility_trends)
mobility_trends
country_region_code | country_region | sub_region_1 | sub_region_2 | metro_area | iso_3166_2_code | census_fips_code | place_id | date | Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-15 | 0.0 | 4.0 | 5.0 | 0.0 | 2.0 | 1.0 |
1 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-16 | 1.0 | 4.0 | 4.0 | 1.0 | 2.0 | 1.0 |
2 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-17 | -1.0 | 1.0 | 5.0 | 1.0 | 2.0 | 1.0 |
3 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-18 | -2.0 | 1.0 | 5.0 | 0.0 | 2.0 | 1.0 |
4 | AE | United Arab Emirates | NaN | NaN | NaN | NaN | NaN | ChIJvRKrsd9IXj4RpwoIwFYv0zM | 2020-02-19 | -2.0 | 0.0 | 4.0 | -1.0 | 2.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9900273 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-24 | NaN | NaN | NaN | NaN | 7.0 | NaN |
9900274 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-25 | NaN | NaN | NaN | NaN | 13.0 | NaN |
9900275 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-28 | NaN | NaN | NaN | NaN | -3.0 | NaN |
9900276 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-29 | NaN | NaN | NaN | NaN | 12.0 | NaN |
9900277 | ZW | Zimbabwe | Midlands Province | Kwekwe | NaN | NaN | NaN | ChIJRcIZ3-FJNBkRRsj55IcLpfU | 2021-06-30 | NaN | NaN | NaN | NaN | 5.0 | NaN |
5970596 rows × 15 columns
Select data for the United Kingdom and compute mean mobility trends for each county.
# Select data for the UK
mobility_trends_UK = mobility_trends[
mobility_trends["country_region"] == "United Kingdom"
]
# Compute mean mobility trends for each UK county per mobility category
# and assign the result to the variable mobility_trends_UK_mean
mobility_trends_UK_mean = mobility_trends_UK.groupby("sub_region_1")[
[
"Retail_Recreation",
"Grocery_Pharmacy",
"Parks",
"Transit_stations",
"Workplaces",
"Residential",
]
].mean()
mobility_trends_UK_mean
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | |
---|---|---|---|---|---|---|
sub_region_1 | ||||||
Aberdeen City | -50.046371 | -10.722567 | 20.557692 | -46.127016 | -42.489919 | 14.567010 |
Aberdeenshire | -28.253669 | -11.248447 | 22.474684 | -39.953878 | -37.207661 | 12.222222 |
Angus Council | -25.955975 | -6.125786 | 13.982143 | -31.150943 | -33.542339 | 10.831551 |
Antrim and Newtownabbey | -29.377358 | -7.465409 | -29.134328 | -53.752621 | -33.679435 | 12.859031 |
Ards and North Down | -27.262055 | 0.452830 | 6.838298 | -41.721311 | -35.991935 | 12.679039 |
... | ... | ... | ... | ... | ... | ... |
Windsor and Maidenhead | -42.714885 | -11.178197 | 0.379455 | -43.693920 | -43.711694 | 16.709220 |
Wokingham | -39.044025 | -16.285115 | 30.458101 | -51.299790 | -45.034274 | 18.237327 |
Worcestershire | -36.025497 | -9.990563 | 26.511954 | -34.033107 | -33.779758 | 12.112019 |
Wrexham Principal Area | -42.293501 | -10.448637 | -1.860140 | -38.511530 | -31.306452 | 11.113895 |
York | -41.892276 | -12.343621 | 2.055319 | -47.364729 | -44.381048 | 14.039666 |
151 rows × 6 columns
The k-means clustering algorithm in scikit-learn
¶
The KMeans
estimator class in scikit-learn allows you to set up the algorithm parameters before fitting the estimator to the data.
Parameters of the KMeans
algorithm include:
n_clusters
— Number of clustersk
to form (same as the number of centroids to generate).init
(‘random’ or ‘k-means++’, default=’k-means++’) Method of selection of initial centroids. ‘random’ selects n_clusters observations (rows) at random from data for the initial centroids. ‘k-means++’ selects initial cluster centers in a way that speeds up convergence.n_init
(default = 10) — Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs. The best output is measured in terms of the sum of squared distances of samples to their closest cluster center.max_iterint
(default=300) — Maximum number of iterations of the k-means algorithm for a single run.random_state
(default=None) For computational reproducibility, determines random number generation for centroid initialization.
We instantiate the KMeans class with the following arguments:
kmeans = KMeans(n_clusters=3, init="k-means++", n_init=10, max_iter=300, random_state=0)
kmeans
KMeans(n_clusters=3, random_state=0)
Data preprocessing¶
We preprocess the data in a format expected by the scikit-learn
library. As part of the data preprocessing, we first remove countries with one or more NaN (Not a Number) using the Pandas method dropna()
. Although some scikit-learn
functions, such as StandardScaler()
, handle NaNs, others, such as fit()
, may require fine-tuning, so we remove NaNs at this stage to avoid unexpected downstream problems.
Tip
scikit-learn
works on any numeric data stored as NumPy arrays, SciPy sparse matrices, or (nowadays) pandas DataFrame. If needed, you can convert Pandas DataFrame into a NumPy array using the Pandas method to_numpy()
.
# Drop NaNs from the DataFrame
mobility_trends_UK_mean_NaNdrop = mobility_trends_UK_mean.dropna()
mobility_trends_UK_mean_NaNdrop
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | |
---|---|---|---|---|---|---|
sub_region_1 | ||||||
Aberdeen City | -50.046371 | -10.722567 | 20.557692 | -46.127016 | -42.489919 | 14.567010 |
Aberdeenshire | -28.253669 | -11.248447 | 22.474684 | -39.953878 | -37.207661 | 12.222222 |
Angus Council | -25.955975 | -6.125786 | 13.982143 | -31.150943 | -33.542339 | 10.831551 |
Antrim and Newtownabbey | -29.377358 | -7.465409 | -29.134328 | -53.752621 | -33.679435 | 12.859031 |
Ards and North Down | -27.262055 | 0.452830 | 6.838298 | -41.721311 | -35.991935 | 12.679039 |
... | ... | ... | ... | ... | ... | ... |
Windsor and Maidenhead | -42.714885 | -11.178197 | 0.379455 | -43.693920 | -43.711694 | 16.709220 |
Wokingham | -39.044025 | -16.285115 | 30.458101 | -51.299790 | -45.034274 | 18.237327 |
Worcestershire | -36.025497 | -9.990563 | 26.511954 | -34.033107 | -33.779758 | 12.112019 |
Wrexham Principal Area | -42.293501 | -10.448637 | -1.860140 | -38.511530 | -31.306452 | 11.113895 |
York | -41.892276 | -12.343621 | 2.055319 | -47.364729 | -44.381048 | 14.039666 |
141 rows × 6 columns
Data standardisation¶
It is a good practice to standardise our input features or variables before applying the k-means algorithm. Standardisation of input features in a data set is a common requirement for many statistical and machine learning estimators. By standardising individual features, all features are converted to the same scale so that the output of the clustering procedure is not influenced by how individual features are measured.
In our example data, the six features are measured on a similar scale so one may argue that standardisation is not strictly necessary but we will perform it so that the procedure is part of your data analysis workflow.
The sklearn.preprocessing
module includes StandardScaler
among other methods for data scaling. The StandardScaler
method calculates a standard score or z-score of a sample observation x
as z = (x - M) / SD
where M is the mean of the sample observations and SD is the standard deviation of the sample observations. In simple words, for each observation in a column, we subtract the mean and divide by the standard deviation of that column.
# Data standardisation
scaler = StandardScaler() # Initialising the scaler using the default arguments
mobility_trends_UK_standardised = scaler.fit_transform(
mobility_trends_UK_mean_NaNdrop
) # Fit to input data (continuous variable) and return the standardised variables
mobility_trends_UK_standardised
array([[-2.41537629e+00, -5.87410490e-01, 2.10134074e-01,
-7.91288930e-01, -1.67445518e+00, 1.27693514e+00],
[ 1.31187764e+00, -6.92727365e-01, 2.89076997e-01,
-1.75449527e-01, -4.93614030e-01, 5.60796729e-02],
[ 1.70485729e+00, 3.33177282e-01, -6.06512500e-02,
7.02741514e-01, 3.25763531e-01, -6.67998047e-01],
[ 1.11969050e+00, 6.48938482e-02, -1.83621506e+00,
-1.55202810e+00, 2.95115745e-01, 3.87645364e-01],
[ 1.48147561e+00, 1.65066306e+00, -3.54839347e-01,
-3.51770707e-01, -2.21840285e-01, 2.93929581e-01],
[ 2.00258288e+00, 2.11421107e-01, 3.05692453e-01,
2.65069849e+00, 3.83734106e-01, -1.67810664e+00],
[ 1.09194483e+00, 6.58562770e-01, -4.56719269e-01,
4.51770033e-01, 9.42324863e-01, -8.67190528e-01],
[-2.21396518e+00, -1.85507507e+00, 4.28819439e-01,
-9.18325110e-01, -2.03772158e+00, 1.31997203e+00],
[-3.28335111e-01, -9.81312186e-01, -4.81345232e-01,
-8.88417676e-01, 5.17312190e-01, 6.06094340e-01],
[-7.31534167e-01, -7.40711678e-01, -7.39828984e-01,
-1.01490455e+00, -1.09824341e+00, 5.09791231e-01],
[-4.17089442e-01, -2.39916438e-01, -2.18071876e+00,
4.71429465e-01, 1.22085915e+00, -5.87127887e-01],
[ 6.34919974e-01, 6.12376786e-01, -9.35957607e-02,
1.92957377e+00, 1.28576034e+00, -1.41051636e+00],
[ 4.67831907e-01, -5.43432333e-02, -6.76448867e-01,
-1.72103240e-01, 6.05199222e-01, -2.57406512e-01],
[-9.44169824e-01, -8.01254388e-01, -3.82735021e-02,
-1.10237086e+00, -1.48290652e+00, 2.02436811e+00],
[-2.07332963e-01, 5.09093962e-01, -4.33669270e-01,
-7.48919361e-01, 3.03679097e-01, 3.32261862e-02],
[-1.08262990e+00, -1.54858287e+00, 6.84139397e-01,
-1.05465004e+00, -1.94172190e+00, 1.27498524e+00],
[-2.28479278e+00, -1.54696956e+00, 1.40169462e+00,
-1.82849460e+00, -1.80585029e+00, 1.84153895e+00],
[-4.13371668e-01, -1.19533632e+00, 2.14491224e-01,
-4.03248568e-01, -1.00633215e+00, 1.55218767e+00],
[ 9.14953837e-01, -4.32207224e-01, -7.37812544e-01,
-1.56417523e-01, 4.56918025e-01, -2.04746545e-01],
[-9.97292725e-01, -9.48753868e-01, 2.41068305e-01,
-4.28912502e-01, -7.31326728e-01, 1.25170028e+00],
[-2.34543767e+00, -1.84777037e+00, -9.86176808e-01,
-1.78656312e+00, -2.53573545e+00, 1.38310243e+00],
[-3.73703914e-01, 1.96974821e+00, 1.42562908e-01,
-3.20385557e-01, 6.54325820e-01, -6.92023258e-01],
[ 1.77154910e+00, 2.35265039e+00, -4.67063822e-01,
1.03247984e+00, 1.14649320e+00, -1.04451564e+00],
[ 2.80203585e-01, 9.05448548e-02, 6.35362397e-01,
7.94229726e-01, -1.33953253e-01, 9.48051851e-01],
[ 4.42795821e-02, -9.80157490e-02, -3.14236903e-01,
-5.89667840e-01, -7.02608759e-01, -1.27607217e+00],
[-2.31326176e-01, -5.18900999e-01, 4.54885264e-01,
8.00835190e-02, -4.13592167e-01, 4.57931896e-01],
[-5.46873897e-01, -3.50548290e-01, 4.54961929e-02,
-2.41026933e-01, -5.12150647e-01, 3.24134813e-01],
[ 5.01536367e-01, 2.45845230e+00, 1.11553059e+00,
1.15950961e+00, 2.57256716e-01, -8.16295581e-01],
[ 2.71938293e+00, -3.24968745e-01, 2.62289434e+00,
1.64105820e+00, -4.13853194e-01, -1.30424830e+00],
[ 4.13223226e-01, 1.26185094e-01, 6.32126041e-01,
7.13201592e-01, 7.01723227e-01, -4.52718023e-01],
[ 8.30740701e-01, 7.93206094e-01, 8.85967884e-01,
1.32457083e+00, 5.78600973e-02, -8.01723499e-01],
[ 2.78207349e-03, -3.91481883e-01, -1.42321175e+00,
-7.68578794e-01, 5.60128950e-01, -1.85592263e-01],
[ 9.99214987e-01, 1.14264592e+00, -5.03484910e-01,
4.16424882e-01, 7.87002143e-01, -1.14459949e+00],
[-5.63568276e-01, 1.58499239e-01, 6.15121148e-01,
-1.32519642e+00, -1.59643308e-01, -1.41011204e-01],
[-2.64686613e-01, 5.73815584e-01, 9.02861806e-01,
1.24289248e+00, 2.63312668e-01, -1.08285957e-01],
[ 7.36033354e-01, 1.51757097e+00, -9.55616532e-01,
7.97048121e-02, 8.03057720e-01, -7.86508306e-01],
[ 7.98585653e-01, 7.63343834e-01, 1.53908709e+00,
1.74000469e+00, 6.02532075e-01, -6.86214329e-01],
[ 1.93697825e-01, 5.74194792e-01, 2.32295345e+00,
1.17977073e+00, -4.22511086e-02, -3.27218433e-01],
[ 7.49658561e-01, 1.32402063e+00, 4.11615487e-01,
1.43306852e+00, 1.31550672e+00, -1.63551190e+00],
[-6.58641059e-01, -9.62056508e-01, -4.01955754e-01,
-2.56596972e-01, -8.50119888e-01, -3.04841967e-01],
[ 6.38522267e-01, 9.02912210e-01, -4.52383291e-01,
1.45879310e+00, -5.36430789e-01, 5.04308110e-01],
[ 2.25412405e+00, 1.26062345e+00, -2.70119627e+00,
-2.28236281e-01, -1.43287852e+00, 1.60023489e+00],
[ 6.18399411e-01, 6.69405557e-01, 2.13101597e+00,
1.34271879e+00, 9.37555002e-01, -6.43560376e-01],
[ 3.89600817e-01, 4.26595899e-01, 1.26236199e+00,
6.67435614e-01, 1.14903272e-01, -2.51113270e-01],
[-2.93314794e+00, -1.54457592e+00, -4.06682329e-01,
-2.27125929e+00, -3.40943264e+00, 2.59972243e+00],
[ 2.80645148e-01, -1.58584033e-01, 4.77542547e-01,
-7.50851189e-01, -3.29725724e-01, 6.80465213e-01],
[ 5.45997570e-01, -8.83124919e-01, -1.77862601e-01,
-8.02041658e-01, -4.92211395e-02, 2.49417221e-01],
[ 5.07990413e-01, 1.81230488e+00, -1.59848407e+00,
3.01359819e-01, 1.28981667e+00, -1.30511626e+00],
[ 6.41032672e-01, 7.25393078e-01, 5.87751098e-01,
1.37225816e-01, 4.42522634e-01, -2.64987231e-01],
[ 2.46243012e-01, -8.25605623e-01, -9.66875108e-01,
-3.21222129e-01, 7.15671917e-02, -1.88649381e-01],
[-2.16691003e+00, -1.47356459e+00, -6.04721112e-01,
-2.26470128e+00, -1.72079496e+00, 1.05846299e+00],
[-2.14333635e-01, -3.06264997e-01, 5.71263043e-01,
5.27578736e-01, -8.70086780e-02, 4.00978766e-01],
[-1.71155996e+00, -1.35866849e+00, 7.66843214e-01,
-1.35821786e+00, -2.53682878e+00, 2.33003813e+00],
[-2.72949807e-01, -6.52074448e-01, 4.39292077e-01,
-4.99890346e-01, -3.05828277e-01, 3.51667405e-01],
[ 1.05228158e+00, 7.64781929e-01, 2.59030425e-01,
1.53251597e+00, 5.51933518e-01, -1.39435184e+00],
[-3.40666694e-01, -1.02394470e-01, 7.10133590e-01,
1.57309470e-01, -1.23213130e-01, 8.55885418e-01],
[-7.16844000e-01, -1.84076537e-01, -2.56369869e+00,
9.33426133e-01, 1.55492416e+00, -1.24951587e+00],
[-8.68574473e-02, -7.16864763e-01, 4.07809474e-01,
7.89535817e-01, 1.54761863e+00, -8.79109984e-01],
[-6.44915163e-01, -4.09765378e-01, 3.00150506e-01,
-8.68784983e-01, -1.05038802e+00, 1.60479716e+00],
[ 1.09817702e+00, -9.47379435e-02, 1.29748466e+00,
4.36084315e-01, 2.60967426e-03, -1.26638571e+00],
[ 1.61298931e-01, 4.45696781e-01, -2.95432646e+00,
9.23988198e-01, -1.84744710e-01, -3.74400569e-01],
[ 9.30834348e-01, 9.88358987e-01, -5.37702293e-01,
1.22332637e+00, 1.10733781e+00, -1.41475499e+00],
[ 1.49008100e+00, 1.51715112e+00, 1.10412350e+00,
1.43369595e+00, 1.39978813e+00, -1.31015656e+00],
[ 2.52791134e-01, 2.42628706e-02, 4.89546495e-01,
1.35651975e-01, 1.00990087e-01, 3.26616517e-01],
[-4.28650440e-01, -4.15575521e-01, -1.49066157e-01,
6.49273481e-02, 1.27758176e+00, -1.24305429e+00],
[ 2.81747471e-01, -1.51138242e-02, 1.18620310e+00,
6.11550108e-01, 6.57643438e-01, -2.07875846e-01],
[-1.62860149e+00, -1.29204685e+00, 8.76055007e-01,
-8.61185510e-01, 2.72884070e-01, -3.00251579e-01],
[-2.89775591e-02, -4.58944935e-02, 1.86393205e-01,
2.83582437e-01, -2.33198855e-01, 5.95843693e-01],
[ 1.11291967e+00, 1.32099448e+00, 9.23389281e-01,
9.76095144e-01, 1.53256874e+00, -1.05584050e+00],
[ 5.49941708e-01, -1.06404494e-01, -2.03025104e+00,
3.69783876e-01, -5.08036517e-01, 1.04620831e+00],
[ 1.60547630e-01, -3.88542941e-01, -1.14117162e+00,
-2.44514578e+00, 1.28530964e+00, -2.04951200e-01],
[-2.91278134e-01, 2.03615734e-01, 4.31891105e-01,
-1.20716914e-01, -2.05275379e-01, 2.09543280e-01],
[-8.39879102e-02, -1.42133957e-01, 2.21155977e-01,
-3.75182043e-01, -1.24610121e-01, -1.79113271e-01],
[ 1.06734102e+00, 5.48559756e-01, -3.06610248e+00,
2.97031258e-01, 1.99696925e+00, -7.44713373e-01],
[ 1.16379315e+00, 1.37776129e+00, -1.00397095e+00,
8.44958686e-01, 5.73199329e-01, -1.48322085e-01],
[-6.74534146e-01, 5.35579285e-02, -1.02887208e+00,
-6.98515922e-01, 1.35336575e+00, -1.24003431e+00],
[ 5.87231749e-01, 1.03348521e+00, -3.23060161e-01,
7.06434607e-01, -6.90571122e-01, 6.32351230e-01],
[-3.80270812e-01, -1.41093100e+00, 1.79398017e-01,
-9.56948230e-01, -3.16037155e-01, 9.37975090e-01],
[-3.37908510e-02, -1.59015043e+00, -7.77171529e-01,
7.02532371e-01, -5.74271117e-01, 1.06204838e-01],
[ 7.96988228e-01, -5.49345061e-01, -1.91604644e-01,
4.00940523e-01, 9.45183094e-01, -1.96264956e+00],
[ 6.74002805e-01, 3.34517314e+00, -2.72192267e-01,
-1.00009998e+00, 6.06100628e-01, -4.82884217e-01],
[-5.10347549e-01, -4.48581330e-01, -3.15824134e-01,
-4.92510164e-01, -6.40438031e-03, 1.96856875e-01],
[ 1.25630113e+00, 2.15658096e+00, -2.47587397e-01,
7.96093374e-01, 1.03336681e+00, -3.38756851e-01],
[ 9.78453000e-01, 1.54057068e-01, 1.08356714e+00,
7.36095370e-01, 9.16510695e-01, -6.55529139e-01],
[ 1.44884682e+00, 2.05497753e+00, -1.00373159e+00,
1.75807659e+00, 3.99678778e-01, -5.97764058e-01],
[ 4.48111213e-01, 2.45009017e-01, 4.14817282e-02,
1.88084347e+00, 1.86942038e+00, -1.87495438e+00],
[ 6.39264986e-01, -1.73817839e-01, -2.48957355e+00,
-2.70478669e-01, 3.67707787e-01, -7.82936259e-02],
[ 6.25956022e-01, 7.08942028e-01, -4.92139556e-01,
1.55959998e+00, 1.60936490e+00, -1.37707138e+00],
[ 3.46468406e-03, -5.80221591e-02, 8.56699378e-01,
-1.57311556e+00, 7.02150836e-02, 4.21075187e-01],
[ 7.62039740e-01, 2.55262608e-01, 8.58026425e-01,
1.34394919e+00, -1.23652422e-01, -3.53544690e-01],
[-1.17816393e-01, -2.25268232e-01, -5.08726992e-01,
4.02392764e-01, 7.69069752e-01, -9.54373891e-02],
[ 5.71631132e-01, 1.12726832e+00, 1.37285954e+00,
1.29921707e+00, 1.54463075e-01, -6.50961458e-01],
[-3.03161070e+00, -1.35627621e+00, 1.28228457e+00,
-1.58863061e+00, -8.68305156e-01, -1.17706720e-01],
[-3.19895877e-01, -2.73070121e-01, 1.10344469e-01,
7.32221235e-01, 4.21126285e-01, -1.24111849e-01],
[-5.35541037e-01, -1.04817560e+00, 1.13107674e-01,
-5.19481536e-01, -1.07653918e+00, 1.45166915e+00],
[ 1.37749376e+00, 2.06967224e+00, 1.03342075e+00,
3.60129778e-01, 1.18125200e+00, -1.72294656e+00],
[-5.58719885e-01, -9.33086936e-01, 3.27320542e-01,
9.97825305e-02, -5.34177276e-01, -1.53553186e-01],
[-2.17756147e-01, -2.83057826e-01, -1.15057885e+00,
-9.02011964e-01, 1.22040845e+00, -7.45152709e-01],
[-9.66288743e-01, -9.95747909e-01, 7.60416248e-01,
5.80423774e-01, 1.24447538e-01, -7.22197418e-01],
[-8.48884276e-01, -1.14344048e+00, 1.10948869e-01,
3.84525108e-01, -8.42457942e-01, -7.20442253e-04],
[ 5.11217435e-01, -9.29693301e-02, 3.85360723e-01,
1.45695782e-01, 1.08564833e+00, -6.99833046e-01],
[-1.96303916e+00, -6.90117766e-01, 2.70115427e-01,
-2.31438314e+00, -1.90070795e+00, 2.53689337e+00],
[-6.12862155e-01, 3.95314916e-01, -6.28826401e-01,
1.00432558e+00, 1.21454931e+00, -1.14703650e+00],
[-6.40820778e-01, 1.76738105e+00, -4.34158845e-01,
-1.95880190e+00, -6.46852958e-01, 6.48396695e-01],
[-6.24754977e-02, 2.36955791e-01, -7.06181374e-01,
-5.99382187e-01, 6.53424414e-01, -2.30024611e-01],
[ 1.23489386e+00, -1.47549684e-01, 5.87986383e-02,
3.20010005e-01, 4.58338698e-01, -9.66638515e-01],
[ 5.26336158e-01, 2.17542370e-01, 6.01233186e-01,
6.01633925e-01, 1.25008009e+00, -4.60496924e-01],
[ 4.42732842e-01, -2.28751942e+00, -1.07182966e-01,
-1.24856175e+00, -6.95979555e-01, 1.02781782e+00],
[ 2.44440855e-01, 2.12100527e-01, 8.73219238e-01,
2.12948099e+00, 9.10159412e-01, -6.83801744e-01],
[ 9.20630362e-02, 1.79092743e-01, -6.04521843e-01,
3.93210020e-01, -5.10872875e-02, -9.01347151e-01],
[-1.90500018e-01, -3.77676214e-01, 1.40173167e+00,
-6.76644044e-01, -5.94877340e-01, 1.16431008e+00],
[ 1.22055326e+00, 5.76189492e-01, 1.13972337e+00,
1.74745081e-01, 3.19772526e-01, 3.11688572e-01],
[-2.23343653e-01, -1.15253437e-01, 3.92435213e-01,
-2.85070809e-01, 5.49514272e-01, -4.71387384e-01],
[-2.03419071e+00, -6.26516482e-01, -2.05213605e-01,
-1.01632511e+00, -1.22356368e+00, 2.09040279e-01],
[-2.48208585e-01, 6.37987567e-01, 8.25699195e-01,
-1.03990657e+00, -6.57219120e-01, 7.03108275e-01],
[ 1.33278925e-01, -2.50996324e-01, 2.49996081e-01,
6.94763834e-01, 4.59955378e-01, -1.38575503e-01],
[-1.25145410e+00, 3.17642873e-01, -2.16452239e-02,
-1.11261887e+00, -1.33810528e+00, 2.84212909e-01],
[ 1.05412429e-01, 9.62104753e-01, -9.51999530e-03,
2.31124272e-01, 9.29254484e-01, -5.84621864e-01],
[-2.82752595e-01, -1.00763038e-01, 1.49771477e-01,
-7.81127368e-01, 1.10600040e+00, -1.03354257e+00],
[ 6.23017701e-02, -4.69149676e-02, 6.56474606e-01,
9.25057384e-01, 4.09338098e-01, -3.12559651e-01],
[-7.51786133e-01, -9.03077530e-01, 4.36249525e-01,
-7.56957751e-01, -1.52692591e+00, 2.13425814e+00],
[-5.52618543e-01, -8.24666042e-01, 6.60124236e-01,
-8.96155963e-01, -6.08543226e-01, 1.29536595e-02],
[-6.58158306e-01, -1.23012960e+00, -8.82904700e-01,
-8.24210805e-01, 1.25756607e-03, 3.65410610e-01],
[-3.55771528e-01, -9.40035435e-01, -7.09654842e-01,
6.27766387e-01, -5.05732477e-02, 4.99329011e-02],
[ 6.14123605e-01, 2.08646620e+00, 1.48454871e+00,
1.20886733e+00, 8.52635020e-01, -1.24475213e+00],
[-6.55693853e-01, -2.18584595e-01, 2.47348392e-01,
-1.89314446e-01, -5.02205260e-01, 9.44491123e-02],
[ 1.58617457e+00, 7.97110292e-01, 1.18959191e-01,
-4.41061011e-01, -5.62120845e-01, 7.86911707e-01],
[-4.41617876e-01, -7.18726629e-01, 8.28438586e-01,
-3.40672419e-01, -2.99811857e-01, 2.46776904e-01],
[-4.38669427e-01, 2.17491065e-01, 2.25606682e-01,
-2.14202060e-02, -1.09095078e-01, 6.85098299e-01],
[-1.03488502e+00, -1.24419495e+00, -1.36924538e-01,
-5.88683983e-02, -1.11152747e+00, 1.95226460e+00],
[ 6.13047931e-01, 5.96422528e-01, -3.11757828e+00,
3.91746020e-01, 9.27502200e-02, 5.18426272e-02],
[-5.18330456e-02, 1.37692159e+00, -1.08294970e+00,
-4.04799959e-01, -3.09276614e-01, 4.24008514e-01],
[-5.58457752e-01, -5.02145516e-01, 1.42395728e-01,
-6.84205280e-01, 8.81843285e-03, -1.96627258e-02],
[-6.17285171e-02, -1.10134851e-01, 1.22496465e+00,
-9.72875943e-01, -1.35134053e-01, 4.23881862e-01],
[-4.19565064e-01, -3.28739566e-01, 3.05450342e-01,
-4.13423496e-01, 3.23716320e-02, 1.30353765e-01],
[ 3.06583623e-01, -9.08665447e-01, 2.17643348e-01,
3.84861943e-01, 6.21764978e-02, 1.89300542e-01],
[-1.16145602e+00, -6.78658516e-01, -6.20818622e-01,
-5.48560462e-01, -1.94758103e+00, 2.39231451e+00],
[-5.33620819e-01, -1.70141038e+00, 6.17839206e-01,
-1.30733091e+00, -2.24324202e+00, 3.18795067e+00],
[-1.73552203e-02, -4.40813569e-01, 4.55334390e-01,
4.15213547e-01, 2.72688684e-01, -1.29939186e-03],
[-1.08938585e+00, -5.32551106e-01, -7.13046565e-01,
-3.15592112e-02, 8.25592857e-01, -5.20990424e-01],
[-1.02076352e+00, -9.12055617e-01, -5.51805464e-01,
-9.14764652e-01, -2.09721434e+00, 1.00236397e+00]])
In the above cell, we printed out the full arrays. This may be necessary for research transparency but in some settings, for example when you deal with large data, may not be practical. In such settings, you could select and print out only a couple of rows. For example, to print out the first 3 rows, you type in:
mobility_trends_UK_standardised[0:3,]
array([[-2.41537629, -0.58741049, 0.21013407, -0.79128893, -1.67445518,
1.27693514],
[ 1.31187764, -0.69272736, 0.289077 , -0.17544953, -0.49361403,
0.05607967],
[ 1.70485729, 0.33317728, -0.06065125, 0.70274151, 0.32576353,
-0.66799805]])
We now fit the k-means class we already created (kmeans
) to our data. This will perform 10 runs of the k-means algorithm (each with a different centroid seed) on your data with a maximum of 300 iterations per run:
kmeans.fit(mobility_trends_UK_standardised)
KMeans(n_clusters=3, random_state=0)
You can access estimator’s learned parameters using an underscore suffix ‘_’. For example, the attribute labels_
will display the cluster each observation or sample (in our example, county) belongs to. The labels of the clusters can be accessed by typing your k-means object (which we called ‘kmeans’) followed by a ‘.’ and the labels_
attribute.
kmeans.labels_
array([2, 0, 1, 0, 1, 1, 1, 2, 0, 2, 0, 1, 0, 2, 0, 2, 2, 2, 0, 2, 2, 1,
1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
2, 0, 0, 1, 1, 0, 2, 0, 2, 0, 1, 0, 0, 1, 2, 1, 0, 1, 1, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
0, 1, 0, 1, 2, 0, 2, 1, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 2, 1, 0,
2, 1, 0, 2, 0, 0, 2, 1, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0,
0, 0, 0, 0, 2, 2, 0, 0, 2], dtype=int32)
The cluster labels indicate that, for example, the first county, Aberdeen City, is assign to cluster 2, the second county, Aberdeenshire, to cluster 0, and so on.
You can also access the coordinates of cluster centers using the cluster_centers_
attribute. This will show the means of the points in each cluster for each of the six variables.
kmeans.cluster_centers_
array([[-0.06582254, -0.20255281, -0.37011903, -0.23552421, 0.08304705,
0.02092666],
[ 0.78930156, 0.86012023, 0.41426658, 0.911355 , 0.70390283,
-0.80759439],
[-1.32044187, -1.1068233 , 0.15879976, -1.11968451, -1.53739783,
1.4688833 ]])
You can include the cluster assignment as a column in your original DataFrame. Let’s name the new column ‘clusters’.
mobility_trends_UK_mean_NaNdrop["clusters"] = kmeans.labels_
mobility_trends_UK_mean_NaNdrop
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters | |
---|---|---|---|---|---|---|---|
sub_region_1 | |||||||
Aberdeen City | -50.046371 | -10.722567 | 20.557692 | -46.127016 | -42.489919 | 14.567010 | 2 |
Aberdeenshire | -28.253669 | -11.248447 | 22.474684 | -39.953878 | -37.207661 | 12.222222 | 0 |
Angus Council | -25.955975 | -6.125786 | 13.982143 | -31.150943 | -33.542339 | 10.831551 | 1 |
Antrim and Newtownabbey | -29.377358 | -7.465409 | -29.134328 | -53.752621 | -33.679435 | 12.859031 | 0 |
Ards and North Down | -27.262055 | 0.452830 | 6.838298 | -41.721311 | -35.991935 | 12.679039 | 1 |
... | ... | ... | ... | ... | ... | ... | ... |
Windsor and Maidenhead | -42.714885 | -11.178197 | 0.379455 | -43.693920 | -43.711694 | 16.709220 | 2 |
Wokingham | -39.044025 | -16.285115 | 30.458101 | -51.299790 | -45.034274 | 18.237327 | 2 |
Worcestershire | -36.025497 | -9.990563 | 26.511954 | -34.033107 | -33.779758 | 12.112019 | 0 |
Wrexham Principal Area | -42.293501 | -10.448637 | -1.860140 | -38.511530 | -31.306452 | 11.113895 | 0 |
York | -41.892276 | -12.343621 | 2.055319 | -47.364729 | -44.381048 | 14.039666 | 2 |
141 rows × 7 columns
Choosing the optimal number of clusters¶
In the example above, our choice of the number of clusters, k, was arbitrary. Let’s find a more informative method of choosing the optimal k for our data. One such method is the Elbow method for choosing optimal k.
Using the Elbow method, we run the k-means algorithm with various values of k and plot each value of k against the sum of squared distances between each data point (county in the UK) and its cluster centroid. For the case of k = 1 all data points will be assigned to the same cluster, resulting in higher sum of squared distances. As k increases, the sum of squared distances will be close to zero because each data point would be assigned to its own cluster.
We perform multiple runs of the k-means clustering algorithm using a for
loop.
Performing for
loop¶
A for
loop is used to repeatedly execute a block of code, and is perfect fit for repeatedly executing the k-means algorithm. The for
loop will iterate over a sequence of k values, and for each value of k will estimate the k-means algorithm.
Let’s first look at a simple example of a for
loop:
for number in range(1, 4):
print(number)
1
2
3
In this example (and in for
loops in general), there are two parts:
for
loop statement, which in this example is ‘for
number in range (1,4):’number
is the variable name; we could have specified a different variable name;range (1,4)
specifies the set of values to loop or iterate over;range (1,4)
is the range of numbers 1, 2, 3. The first argument (1) is the starting point, and the second argument (4) is the endpoint (not included in the range)the word ‘in’ connects the two components in the
for
loop statementthe
for
loop statement ends in a colon ‘:’.
the loop body, which contains the code to be executed at each iteration of the
for
loop. Each line in the loop body is indented four spaces, and this indentation is how the interpreter knows that a line is part of the loop or not. In our example, ‘print(number)’ is the loop body.
At each iteration of the for
loop, the variable number
is assigned the next number in the range from 1 to 3, and then the value of number
is printed. The loop runs once for each number in the sequence from 1 to 3, so the body loop ‘print(number)’ executes 3 times.
This loop description draws on the Real Python’s book Python Basics (pages 153–154) and on the Kaggle’s Python tutorial.
Choosing k via for
loop¶
We are now ready to apply the for
loop to the k-means algorithm.
In the code below, the for
loop statement is ‘for
k in K’ where k is the variable name and K is the set of values ranging from 1 to 30. The loop body contains the three lines of code related to the k-means()
initiation, estimation, and output. Each of the three lines in the loop body are indented four spaces. The loop will run 30 times, so all three lines related to the k-means()
algorithm will be executed 30 times.
# Run the k-means algorithm for values of k between 1 and 30
Sum_of_squared_distances = [] # Initialise a list
K = range(
1, 31
) # range with a starting point 1 and endpoint 31, which is not included in the range
for k in K: # a for loop iterating over values of k ranging from 1 to 30
kmeans = KMeans(n_clusters=k) # Initialise the KMeans estimator for a value of k
kmeans.fit(
mobility_trends_UK_standardised
) # Perform the KMeans estimator by the fit() method
Sum_of_squared_distances.append(
kmeans.inertia_
) # Store the sum of squared distances (stored in kmeans.inertia_)
# for each run using the Python append() function.
Sum_of_squared_distances
[846.0,
547.7080256569716,
429.7112613888795,
365.4782968827096,
328.8362897853743,
301.72829951678796,
281.05116080411847,
262.8764097559398,
251.88350703120256,
239.34973620094246,
223.86683466112927,
214.68357391767273,
209.50534998708247,
197.3101329363527,
189.8774747310233,
183.84087640511768,
177.90242231870533,
171.01789694172618,
162.46276850451392,
159.95180340703382,
152.35221938560662,
145.9763201437044,
142.81737201523777,
137.44933254954873,
134.80252297181383,
129.93053030365422,
123.79731323466231,
119.02326607285883,
114.9870769137471,
111.87382179317166]
Elbow plot¶
Let’s plot k against the sum of squared distances. The plot below shows how the sum of squared distances varies with values of k between 1 and 30.
# Plot size
plt.figure(figsize=(8.2, 5.8))
# Generate the plot
grid = sns.lineplot(x=K, y=Sum_of_squared_distances)
# Add x and y labels
labels = grid.set(xlabel="Number of clusters, k", ylabel="Total squared distances")
For our data set, the elbow of the curve (where the curve “bends”) is not apparent but total squared distances seem to decrease slowly after k = 4. So we rerun our k-means algorithm with k = 4.
k = 4
kmeans_k4 = KMeans(
n_clusters=k, init="k-means++", n_init=10, max_iter=300, random_state=0
)
kmeans_k4.fit_transform(mobility_trends_UK_standardised)
array([[5.04566964, 4.50949325, 3.01211976, 1.22287598],
[2.47693818, 2.62868693, 1.62080111, 3.45425377],
[1.3003252 , 2.26931091, 2.42065505, 4.87648492],
[3.78652204, 1.69736822, 2.80123963, 4.08388795],
[2.43619727, 2.27100614, 2.6755362 , 4.50826352],
[2.35336613, 4.06071712, 4.12945515, 6.48256892],
[1.19812634, 1.69699685, 2.36632043, 4.95119228],
[5.63435577, 5.12630547, 3.47305719, 1.24210402],
[3.31527914, 2.26316156, 1.36184738, 2.64584341],
[3.91107752, 2.74699518, 1.73162013, 1.61362137],
[3.23295316, 1.57734068, 2.87257396, 4.69992362],
[1.42323139, 2.99252444, 3.2625403 , 5.80151349],
[1.9868273 , 1.08868625, 1.36774609, 3.78215571],
[4.81014859, 4.0108717 , 2.66323944, 0.7956978 ],
[2.42377329, 1.54752186, 1.22286252, 3.2322439 ],
[4.95676567, 4.57502553, 2.8127199 , 0.88207805],
[6.0778808 , 5.76623296, 3.96517641, 1.80031318],
[4.00324484, 3.53155206, 1.92038439, 1.39348625],
[2.21221937, 1.33299872, 1.57762967, 3.85001559],
[3.83037626, 3.38499741, 1.64938171, 1.2633768 ],
[6.49018296, 5.30264079, 4.25308026, 1.96176341],
[2.13250588, 2.57142549, 2.56811458, 4.62071668],
[2.16062392, 3.06827114, 4.01195094, 6.43820377],
[2.16336886, 2.86141265, 1.55576949, 3.30182236],
[2.58694331, 2.26023023, 1.66979223, 3.49667667],
[2.54594373, 2.69751661, 0.66519137, 2.43677741],
[2.75980801, 2.43637953, 0.5873986 , 2.189648 ],
[1.84810182, 3.7594953 , 3.49769804, 5.63411847],
[3.36500349, 5.42478723, 4.5098375 , 6.41226609],
[0.89475272, 2.49282263, 1.6441581 , 4.2259671 ],
[0.82299406, 3.10233895, 2.49617003, 4.85241859],
[3.0540805 , 1.11916236, 1.8278122 , 3.55587669],
[1.29114706, 1.85234 , 2.62858964, 5.1557304 ],
[2.95340139, 2.84462279, 1.35277911, 2.74014982],
[1.43635026, 3.04140168, 1.94487335, 4.08642567],
[1.88750728, 1.49787415, 2.70557177, 5.035078 ],
[1.27540103, 3.79336631, 3.03485238, 5.34532345],
[2.10744321, 4.26357273, 2.81376763, 4.65089272],
[1.22916785, 3.22630651, 3.41029138, 5.99173615],
[3.20253825, 2.59097533, 1.28578913, 2.34702054],
[2.14532941, 2.4787089 , 2.42601355, 4.19214776],
[4.98826038, 3.50358604, 4.56193468, 5.27819359],
[1.67390117, 4.13639275, 3.12392108, 5.36639476],
[1.26486256, 3.07494385, 1.78324735, 4.04594693],
[7.70847761, 6.70218335, 5.51864264, 2.90076983],
[2.74098553, 2.6120279 , 1.0038274 , 2.53682572],
[2.87919726, 2.19946653, 1.14378677, 2.85199774],
[2.58340653, 1.98121907, 3.53222376, 5.80422916],
[1.05929409, 2.29986322, 1.62101985, 4.1115384 ],
[2.77019801, 1.52054039, 1.35405766, 3.22529125],
[5.92350259, 4.76471804, 3.64893739, 1.632762 ],
[2.13467954, 2.69196149, 0.919198 , 2.94858772],
[6.07288086, 5.58202021, 3.97115996, 1.46266203],
[2.79554158, 2.68948577, 0.58751824, 2.26048492],
[0.89842303, 2.91376263, 2.94727422, 5.4733001 ],
[2.51033563, 2.87114994, 1.04103308, 2.63144655],
[3.69897039, 2.42715675, 3.69509245, 5.52416318],
[1.94330508, 2.86723277, 2.22832533, 4.63689781],
[3.99699334, 3.47763129, 1.9320706 , 1.2443638 ],
[1.56568879, 3.34530856, 2.33706415, 4.64591563],
[3.70312069, 1.8928401 , 3.47236901, 4.91732907],
[1.33156857, 2.35253708, 3.08158649, 5.65230424],
[1.50516965, 3.79954956, 3.7845444 , 6.33227007],
[1.81494689, 2.32581749, 0.82610413, 3.19904648],
[2.16612644, 2.32256385, 1.99274074, 4.35574941],
[1.35885683, 2.99160696, 1.67474588, 4.06017201],
[3.7527546 , 3.77331365, 2.0297521 , 2.75979523],
[2.21222118, 2.31396699, 0.80377903, 2.80661356],
[1.12327546, 3.30440273, 3.25704519, 5.83380732],
[3.58536247, 1.85828087, 2.61320207, 3.63128213],
[4.12389682, 2.57491558, 2.94590157, 4.20102445],
[2.15064991, 2.39589618, 0.64530149, 2.80374165],
[2.15938102, 2.15279659, 0.40980959, 2.90018924],
[3.9055487 , 2.22732609, 4.26824804, 6.26538848],
[1.83808668, 1.70862451, 2.80959415, 5.06294459],
[2.91013144, 1.93948426, 2.43370615, 4.45258233],
[2.2422943 , 2.29171174, 2.06990336, 3.72582045],
[3.77560688, 3.17735838, 1.5496857 , 1.79343351],
[3.26910562, 2.68487471, 1.89399813, 3.05899099],
[1.99787171, 2.59533878, 2.64601988, 5.12835041],
[3.3462147 , 3.36015143, 3.99890849, 5.7799169 ],
[2.76663745, 2.01683317, 0.63335189, 2.48998643],
[1.75662781, 2.62028536, 3.33402792, 5.67671653],
[0.93884667, 3.00363082, 2.2563041 , 4.79080907],
[2.27849558, 2.83721861, 3.82212792, 6.03725973],
[1.9742459 , 3.41599717, 3.62245039, 6.17410372],
[3.51604344, 1.06343546, 2.82994931, 4.36761373],
[1.59126789, 2.64219696, 3.25675522, 5.82599461],
[3.14594168, 3.05188033, 1.58218108, 2.81552348],
[1.20964787, 3.04416457, 2.12956719, 4.36785299],
[1.95147469, 1.57197865, 1.29177045, 3.68011088],
[1.13994117, 3.50176752, 2.69812049, 4.9377037 ],
[5.4068497 , 5.28949581, 3.56698561, 2.73488982],
[1.7800033 , 2.25370853, 1.13622088, 3.50381084],
[3.99400257, 3.43832344, 1.85154615, 1.20925628],
[1.90180548, 3.71230084, 3.80583123, 6.27537405],
[2.75442991, 2.84043816, 0.94135893, 2.50916025],
[2.96276092, 1.56083336, 2.14888028, 4.11439899],
[2.6067065 , 3.30191569, 1.67113275, 3.41168003],
[3.16743305, 3.0980149 , 1.42217789, 2.35168104],
[1.31272557, 2.2486955 , 1.68569781, 4.32472091],
[6.22696831, 5.3805621 , 4.07644653, 1.74208451],
[1.94738474, 2.22118897, 2.48769421, 4.81700582],
[4.06632173, 3.14254286, 2.9194489 , 3.4054936 ],
[2.32398421, 1.17895364, 1.38137454, 3.58144638],
[1.35635022, 2.16501713, 1.98342243, 4.5314781 ],
[0.98334406, 2.51141366, 1.9751961 , 4.5795432 ],
[4.51899203, 3.73359642, 2.54591335, 2.43381982],
[1.46008785, 3.49327828, 2.88111841, 5.20078181],
[1.75263745, 1.57317036, 1.54657994, 3.88408702],
[3.38842338, 3.7252572 , 1.75606989, 2.26193643],
[1.64889901, 2.98158215, 2.05832151, 4.22109929],
[1.89343999, 2.24161257, 0.93183205, 3.44686497],
[4.38759674, 3.66805042, 2.3297036 , 1.61077625],
[3.07350247, 3.12443954, 1.6452764 , 2.5612204 ],
[1.48812828, 2.23168941, 1.17636061, 3.68707324],
[3.80108261, 3.21136222, 1.97261914, 1.92152769],
[1.19751844, 1.8954884 , 1.89520261, 4.40973804],
[2.31066106, 2.32539551, 1.76530322, 4.04419728],
[1.26355828, 2.64443275, 1.48693168, 3.91849876],
[4.69564833, 4.21611107, 2.65913276, 1.07012616],
[3.21794734, 3.13735726, 1.14359739, 2.10922838],
[3.65356191, 2.36798452, 1.57421401, 2.36670423],
[2.70708072, 2.12178377, 1.37615431, 3.08978484],
[1.6824539 , 3.97332374, 3.63315124, 5.96153282],
[2.61632817, 2.54788451, 0.62779367, 2.36014248],
[2.66351031, 2.62405874, 2.28809987, 3.84636431],
[2.78823738, 3.04083381, 0.8450561 , 2.36798068],
[2.42425876, 2.38244569, 0.86040971, 2.64395037],
[4.4776361 , 3.84687199, 2.45234533, 1.36333137],
[3.86505674, 1.67654658, 3.56763307, 4.98620071],
[2.85933146, 1.65128535, 2.16534265, 3.59976031],
[2.74228878, 2.38805378, 0.60926247, 2.51372308],
[2.84256801, 3.21854111, 1.35780561, 2.74091495],
[2.46130284, 2.37171651, 0.33238111, 2.61227263],
[2.24577048, 2.45421304, 1.00646126, 3.14919793],
[5.20285954, 4.25485027, 3.21405765, 1.42658464],
[6.1704489 , 5.59614227, 4.1646318 , 2.14453473],
[1.83319912, 2.43144008, 0.84560399, 3.29084402],
[2.81392292, 2.09360832, 1.69170905, 3.53257811],
[4.70707215, 3.77226267, 2.59610041, 1.0732578 ]])
Let’s view to which cluster each observation or sample (in our example, UK county) in our data set was assigned to:
kmeans_k4.labels_
array([3, 2, 0, 1, 1, 0, 0, 3, 2, 3, 1, 0, 1, 3, 2, 3, 3, 3, 1, 3, 3, 0,
0, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 2, 0, 1, 0, 0,
3, 2, 2, 1, 0, 2, 3, 2, 3, 2, 0, 2, 1, 0, 3, 0, 1, 0, 0, 2, 2, 0,
2, 2, 0, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0,
2, 0, 2, 0, 3, 2, 3, 0, 2, 1, 2, 2, 0, 3, 0, 2, 1, 0, 0, 3, 0, 2,
2, 0, 2, 3, 2, 2, 3, 0, 2, 0, 3, 2, 2, 2, 0, 2, 2, 2, 2, 3, 1, 1,
2, 2, 2, 2, 3, 3, 2, 2, 3], dtype=int32)
Here are also the centers of the four detected clusters:
kmeans_k4.cluster_centers_
array([[ 0.77303297, 0.7952798 , 0.53804385, 0.9729365 , 0.71215567,
-0.83613574],
[ 0.46188499, 0.41371826, -1.65713325, -0.18743052, 0.55582752,
-0.21609699],
[-0.210301 , -0.34287992, 0.15765022, -0.24250778, -0.09133243,
0.11912676],
[-1.40669657, -1.12453327, 0.10615267, -1.1449252 , -1.62755955,
1.50369502]])
K-Means is an an iterative algorithm so a set of operations are repeatedly performed until sum of distances from each observation to its cluster centroid is minimised and the cluster assignment no longer updates. How many iterations were needed for the algorithm to converge in our case?
kmeans_k4.n_iter_
11
As we did earlier, we add the cluster assignment as a column to our DataFrame. We name the column ‘clusters_k4’.
# Add the 4-cluster assignment to your DataFrame
mobility_trends_UK_mean_NaNdrop["clusters_k4"] = kmeans_k4.labels_
mobility_trends_UK_mean_NaNdrop
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters | clusters_k4 | |
---|---|---|---|---|---|---|---|---|
sub_region_1 | ||||||||
Aberdeen City | -50.046371 | -10.722567 | 20.557692 | -46.127016 | -42.489919 | 14.567010 | 2 | 3 |
Aberdeenshire | -28.253669 | -11.248447 | 22.474684 | -39.953878 | -37.207661 | 12.222222 | 0 | 2 |
Angus Council | -25.955975 | -6.125786 | 13.982143 | -31.150943 | -33.542339 | 10.831551 | 1 | 0 |
Antrim and Newtownabbey | -29.377358 | -7.465409 | -29.134328 | -53.752621 | -33.679435 | 12.859031 | 0 | 1 |
Ards and North Down | -27.262055 | 0.452830 | 6.838298 | -41.721311 | -35.991935 | 12.679039 | 1 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Windsor and Maidenhead | -42.714885 | -11.178197 | 0.379455 | -43.693920 | -43.711694 | 16.709220 | 2 | 3 |
Wokingham | -39.044025 | -16.285115 | 30.458101 | -51.299790 | -45.034274 | 18.237327 | 2 | 3 |
Worcestershire | -36.025497 | -9.990563 | 26.511954 | -34.033107 | -33.779758 | 12.112019 | 0 | 2 |
Wrexham Principal Area | -42.293501 | -10.448637 | -1.860140 | -38.511530 | -31.306452 | 11.113895 | 0 | 2 |
York | -41.892276 | -12.343621 | 2.055319 | -47.364729 | -44.381048 | 14.039666 | 2 | 3 |
141 rows × 8 columns
Our next step will be to assess the way in which clusters are similar or different with respect to each mobility category. To accomplish this, we plot the clusters against each mobility category using the Seaborn function catplot
.
# Create a variable 'mobility_category'
mobility_categories = [
"Retail_Recreation",
"Grocery_Pharmacy",
"Parks",
"Transit_stations",
"Workplaces",
"Residential",
]
# Use a for loop to plot the clusters across the six mobility categories
for mobility_category in mobility_categories:
sns.catplot(
x="clusters_k4",
y=mobility_category,
kind="swarm",
data=mobility_trends_UK_mean_NaNdrop,
)
Dimensionality Reduction via PCA¶
The above plots visualise each variable separately. A more informative approach would be to take into account all six dimensions simultaneously. However, there is a difficulty in visualising and perceiving multidimensional data beyond two or three dimensions. One solution is to use the dimensionality reduction technique Principal Component Analysis (PCA).
We can apply the PCA to reduce the six mobility trends to just 2 dimensions, and then use those 2-dimensional approximations to visualise our clusters using a scatter plot.
The sklearn
library is very consistent so the workflow we used to run k-means
applies to PCA
too. We first initialise the PCA estimator using the default arguments except for n_components
where we specify to keep only 2 components. Then we perform the estimator using the fit()
method. Below we use the fit_transform()
method to simultaneously fit the estimator to data and apply the dimensionality-reduction transformation to data.
# We reuse our standardised data set
mobility_trends_UK_standardised
array([[-2.41537629e+00, -5.87410490e-01, 2.10134074e-01,
-7.91288930e-01, -1.67445518e+00, 1.27693514e+00],
[ 1.31187764e+00, -6.92727365e-01, 2.89076997e-01,
-1.75449527e-01, -4.93614030e-01, 5.60796729e-02],
[ 1.70485729e+00, 3.33177282e-01, -6.06512500e-02,
7.02741514e-01, 3.25763531e-01, -6.67998047e-01],
[ 1.11969050e+00, 6.48938482e-02, -1.83621506e+00,
-1.55202810e+00, 2.95115745e-01, 3.87645364e-01],
[ 1.48147561e+00, 1.65066306e+00, -3.54839347e-01,
-3.51770707e-01, -2.21840285e-01, 2.93929581e-01],
[ 2.00258288e+00, 2.11421107e-01, 3.05692453e-01,
2.65069849e+00, 3.83734106e-01, -1.67810664e+00],
[ 1.09194483e+00, 6.58562770e-01, -4.56719269e-01,
4.51770033e-01, 9.42324863e-01, -8.67190528e-01],
[-2.21396518e+00, -1.85507507e+00, 4.28819439e-01,
-9.18325110e-01, -2.03772158e+00, 1.31997203e+00],
[-3.28335111e-01, -9.81312186e-01, -4.81345232e-01,
-8.88417676e-01, 5.17312190e-01, 6.06094340e-01],
[-7.31534167e-01, -7.40711678e-01, -7.39828984e-01,
-1.01490455e+00, -1.09824341e+00, 5.09791231e-01],
[-4.17089442e-01, -2.39916438e-01, -2.18071876e+00,
4.71429465e-01, 1.22085915e+00, -5.87127887e-01],
[ 6.34919974e-01, 6.12376786e-01, -9.35957607e-02,
1.92957377e+00, 1.28576034e+00, -1.41051636e+00],
[ 4.67831907e-01, -5.43432333e-02, -6.76448867e-01,
-1.72103240e-01, 6.05199222e-01, -2.57406512e-01],
[-9.44169824e-01, -8.01254388e-01, -3.82735021e-02,
-1.10237086e+00, -1.48290652e+00, 2.02436811e+00],
[-2.07332963e-01, 5.09093962e-01, -4.33669270e-01,
-7.48919361e-01, 3.03679097e-01, 3.32261862e-02],
[-1.08262990e+00, -1.54858287e+00, 6.84139397e-01,
-1.05465004e+00, -1.94172190e+00, 1.27498524e+00],
[-2.28479278e+00, -1.54696956e+00, 1.40169462e+00,
-1.82849460e+00, -1.80585029e+00, 1.84153895e+00],
[-4.13371668e-01, -1.19533632e+00, 2.14491224e-01,
-4.03248568e-01, -1.00633215e+00, 1.55218767e+00],
[ 9.14953837e-01, -4.32207224e-01, -7.37812544e-01,
-1.56417523e-01, 4.56918025e-01, -2.04746545e-01],
[-9.97292725e-01, -9.48753868e-01, 2.41068305e-01,
-4.28912502e-01, -7.31326728e-01, 1.25170028e+00],
[-2.34543767e+00, -1.84777037e+00, -9.86176808e-01,
-1.78656312e+00, -2.53573545e+00, 1.38310243e+00],
[-3.73703914e-01, 1.96974821e+00, 1.42562908e-01,
-3.20385557e-01, 6.54325820e-01, -6.92023258e-01],
[ 1.77154910e+00, 2.35265039e+00, -4.67063822e-01,
1.03247984e+00, 1.14649320e+00, -1.04451564e+00],
[ 2.80203585e-01, 9.05448548e-02, 6.35362397e-01,
7.94229726e-01, -1.33953253e-01, 9.48051851e-01],
[ 4.42795821e-02, -9.80157490e-02, -3.14236903e-01,
-5.89667840e-01, -7.02608759e-01, -1.27607217e+00],
[-2.31326176e-01, -5.18900999e-01, 4.54885264e-01,
8.00835190e-02, -4.13592167e-01, 4.57931896e-01],
[-5.46873897e-01, -3.50548290e-01, 4.54961929e-02,
-2.41026933e-01, -5.12150647e-01, 3.24134813e-01],
[ 5.01536367e-01, 2.45845230e+00, 1.11553059e+00,
1.15950961e+00, 2.57256716e-01, -8.16295581e-01],
[ 2.71938293e+00, -3.24968745e-01, 2.62289434e+00,
1.64105820e+00, -4.13853194e-01, -1.30424830e+00],
[ 4.13223226e-01, 1.26185094e-01, 6.32126041e-01,
7.13201592e-01, 7.01723227e-01, -4.52718023e-01],
[ 8.30740701e-01, 7.93206094e-01, 8.85967884e-01,
1.32457083e+00, 5.78600973e-02, -8.01723499e-01],
[ 2.78207349e-03, -3.91481883e-01, -1.42321175e+00,
-7.68578794e-01, 5.60128950e-01, -1.85592263e-01],
[ 9.99214987e-01, 1.14264592e+00, -5.03484910e-01,
4.16424882e-01, 7.87002143e-01, -1.14459949e+00],
[-5.63568276e-01, 1.58499239e-01, 6.15121148e-01,
-1.32519642e+00, -1.59643308e-01, -1.41011204e-01],
[-2.64686613e-01, 5.73815584e-01, 9.02861806e-01,
1.24289248e+00, 2.63312668e-01, -1.08285957e-01],
[ 7.36033354e-01, 1.51757097e+00, -9.55616532e-01,
7.97048121e-02, 8.03057720e-01, -7.86508306e-01],
[ 7.98585653e-01, 7.63343834e-01, 1.53908709e+00,
1.74000469e+00, 6.02532075e-01, -6.86214329e-01],
[ 1.93697825e-01, 5.74194792e-01, 2.32295345e+00,
1.17977073e+00, -4.22511086e-02, -3.27218433e-01],
[ 7.49658561e-01, 1.32402063e+00, 4.11615487e-01,
1.43306852e+00, 1.31550672e+00, -1.63551190e+00],
[-6.58641059e-01, -9.62056508e-01, -4.01955754e-01,
-2.56596972e-01, -8.50119888e-01, -3.04841967e-01],
[ 6.38522267e-01, 9.02912210e-01, -4.52383291e-01,
1.45879310e+00, -5.36430789e-01, 5.04308110e-01],
[ 2.25412405e+00, 1.26062345e+00, -2.70119627e+00,
-2.28236281e-01, -1.43287852e+00, 1.60023489e+00],
[ 6.18399411e-01, 6.69405557e-01, 2.13101597e+00,
1.34271879e+00, 9.37555002e-01, -6.43560376e-01],
[ 3.89600817e-01, 4.26595899e-01, 1.26236199e+00,
6.67435614e-01, 1.14903272e-01, -2.51113270e-01],
[-2.93314794e+00, -1.54457592e+00, -4.06682329e-01,
-2.27125929e+00, -3.40943264e+00, 2.59972243e+00],
[ 2.80645148e-01, -1.58584033e-01, 4.77542547e-01,
-7.50851189e-01, -3.29725724e-01, 6.80465213e-01],
[ 5.45997570e-01, -8.83124919e-01, -1.77862601e-01,
-8.02041658e-01, -4.92211395e-02, 2.49417221e-01],
[ 5.07990413e-01, 1.81230488e+00, -1.59848407e+00,
3.01359819e-01, 1.28981667e+00, -1.30511626e+00],
[ 6.41032672e-01, 7.25393078e-01, 5.87751098e-01,
1.37225816e-01, 4.42522634e-01, -2.64987231e-01],
[ 2.46243012e-01, -8.25605623e-01, -9.66875108e-01,
-3.21222129e-01, 7.15671917e-02, -1.88649381e-01],
[-2.16691003e+00, -1.47356459e+00, -6.04721112e-01,
-2.26470128e+00, -1.72079496e+00, 1.05846299e+00],
[-2.14333635e-01, -3.06264997e-01, 5.71263043e-01,
5.27578736e-01, -8.70086780e-02, 4.00978766e-01],
[-1.71155996e+00, -1.35866849e+00, 7.66843214e-01,
-1.35821786e+00, -2.53682878e+00, 2.33003813e+00],
[-2.72949807e-01, -6.52074448e-01, 4.39292077e-01,
-4.99890346e-01, -3.05828277e-01, 3.51667405e-01],
[ 1.05228158e+00, 7.64781929e-01, 2.59030425e-01,
1.53251597e+00, 5.51933518e-01, -1.39435184e+00],
[-3.40666694e-01, -1.02394470e-01, 7.10133590e-01,
1.57309470e-01, -1.23213130e-01, 8.55885418e-01],
[-7.16844000e-01, -1.84076537e-01, -2.56369869e+00,
9.33426133e-01, 1.55492416e+00, -1.24951587e+00],
[-8.68574473e-02, -7.16864763e-01, 4.07809474e-01,
7.89535817e-01, 1.54761863e+00, -8.79109984e-01],
[-6.44915163e-01, -4.09765378e-01, 3.00150506e-01,
-8.68784983e-01, -1.05038802e+00, 1.60479716e+00],
[ 1.09817702e+00, -9.47379435e-02, 1.29748466e+00,
4.36084315e-01, 2.60967426e-03, -1.26638571e+00],
[ 1.61298931e-01, 4.45696781e-01, -2.95432646e+00,
9.23988198e-01, -1.84744710e-01, -3.74400569e-01],
[ 9.30834348e-01, 9.88358987e-01, -5.37702293e-01,
1.22332637e+00, 1.10733781e+00, -1.41475499e+00],
[ 1.49008100e+00, 1.51715112e+00, 1.10412350e+00,
1.43369595e+00, 1.39978813e+00, -1.31015656e+00],
[ 2.52791134e-01, 2.42628706e-02, 4.89546495e-01,
1.35651975e-01, 1.00990087e-01, 3.26616517e-01],
[-4.28650440e-01, -4.15575521e-01, -1.49066157e-01,
6.49273481e-02, 1.27758176e+00, -1.24305429e+00],
[ 2.81747471e-01, -1.51138242e-02, 1.18620310e+00,
6.11550108e-01, 6.57643438e-01, -2.07875846e-01],
[-1.62860149e+00, -1.29204685e+00, 8.76055007e-01,
-8.61185510e-01, 2.72884070e-01, -3.00251579e-01],
[-2.89775591e-02, -4.58944935e-02, 1.86393205e-01,
2.83582437e-01, -2.33198855e-01, 5.95843693e-01],
[ 1.11291967e+00, 1.32099448e+00, 9.23389281e-01,
9.76095144e-01, 1.53256874e+00, -1.05584050e+00],
[ 5.49941708e-01, -1.06404494e-01, -2.03025104e+00,
3.69783876e-01, -5.08036517e-01, 1.04620831e+00],
[ 1.60547630e-01, -3.88542941e-01, -1.14117162e+00,
-2.44514578e+00, 1.28530964e+00, -2.04951200e-01],
[-2.91278134e-01, 2.03615734e-01, 4.31891105e-01,
-1.20716914e-01, -2.05275379e-01, 2.09543280e-01],
[-8.39879102e-02, -1.42133957e-01, 2.21155977e-01,
-3.75182043e-01, -1.24610121e-01, -1.79113271e-01],
[ 1.06734102e+00, 5.48559756e-01, -3.06610248e+00,
2.97031258e-01, 1.99696925e+00, -7.44713373e-01],
[ 1.16379315e+00, 1.37776129e+00, -1.00397095e+00,
8.44958686e-01, 5.73199329e-01, -1.48322085e-01],
[-6.74534146e-01, 5.35579285e-02, -1.02887208e+00,
-6.98515922e-01, 1.35336575e+00, -1.24003431e+00],
[ 5.87231749e-01, 1.03348521e+00, -3.23060161e-01,
7.06434607e-01, -6.90571122e-01, 6.32351230e-01],
[-3.80270812e-01, -1.41093100e+00, 1.79398017e-01,
-9.56948230e-01, -3.16037155e-01, 9.37975090e-01],
[-3.37908510e-02, -1.59015043e+00, -7.77171529e-01,
7.02532371e-01, -5.74271117e-01, 1.06204838e-01],
[ 7.96988228e-01, -5.49345061e-01, -1.91604644e-01,
4.00940523e-01, 9.45183094e-01, -1.96264956e+00],
[ 6.74002805e-01, 3.34517314e+00, -2.72192267e-01,
-1.00009998e+00, 6.06100628e-01, -4.82884217e-01],
[-5.10347549e-01, -4.48581330e-01, -3.15824134e-01,
-4.92510164e-01, -6.40438031e-03, 1.96856875e-01],
[ 1.25630113e+00, 2.15658096e+00, -2.47587397e-01,
7.96093374e-01, 1.03336681e+00, -3.38756851e-01],
[ 9.78453000e-01, 1.54057068e-01, 1.08356714e+00,
7.36095370e-01, 9.16510695e-01, -6.55529139e-01],
[ 1.44884682e+00, 2.05497753e+00, -1.00373159e+00,
1.75807659e+00, 3.99678778e-01, -5.97764058e-01],
[ 4.48111213e-01, 2.45009017e-01, 4.14817282e-02,
1.88084347e+00, 1.86942038e+00, -1.87495438e+00],
[ 6.39264986e-01, -1.73817839e-01, -2.48957355e+00,
-2.70478669e-01, 3.67707787e-01, -7.82936259e-02],
[ 6.25956022e-01, 7.08942028e-01, -4.92139556e-01,
1.55959998e+00, 1.60936490e+00, -1.37707138e+00],
[ 3.46468406e-03, -5.80221591e-02, 8.56699378e-01,
-1.57311556e+00, 7.02150836e-02, 4.21075187e-01],
[ 7.62039740e-01, 2.55262608e-01, 8.58026425e-01,
1.34394919e+00, -1.23652422e-01, -3.53544690e-01],
[-1.17816393e-01, -2.25268232e-01, -5.08726992e-01,
4.02392764e-01, 7.69069752e-01, -9.54373891e-02],
[ 5.71631132e-01, 1.12726832e+00, 1.37285954e+00,
1.29921707e+00, 1.54463075e-01, -6.50961458e-01],
[-3.03161070e+00, -1.35627621e+00, 1.28228457e+00,
-1.58863061e+00, -8.68305156e-01, -1.17706720e-01],
[-3.19895877e-01, -2.73070121e-01, 1.10344469e-01,
7.32221235e-01, 4.21126285e-01, -1.24111849e-01],
[-5.35541037e-01, -1.04817560e+00, 1.13107674e-01,
-5.19481536e-01, -1.07653918e+00, 1.45166915e+00],
[ 1.37749376e+00, 2.06967224e+00, 1.03342075e+00,
3.60129778e-01, 1.18125200e+00, -1.72294656e+00],
[-5.58719885e-01, -9.33086936e-01, 3.27320542e-01,
9.97825305e-02, -5.34177276e-01, -1.53553186e-01],
[-2.17756147e-01, -2.83057826e-01, -1.15057885e+00,
-9.02011964e-01, 1.22040845e+00, -7.45152709e-01],
[-9.66288743e-01, -9.95747909e-01, 7.60416248e-01,
5.80423774e-01, 1.24447538e-01, -7.22197418e-01],
[-8.48884276e-01, -1.14344048e+00, 1.10948869e-01,
3.84525108e-01, -8.42457942e-01, -7.20442253e-04],
[ 5.11217435e-01, -9.29693301e-02, 3.85360723e-01,
1.45695782e-01, 1.08564833e+00, -6.99833046e-01],
[-1.96303916e+00, -6.90117766e-01, 2.70115427e-01,
-2.31438314e+00, -1.90070795e+00, 2.53689337e+00],
[-6.12862155e-01, 3.95314916e-01, -6.28826401e-01,
1.00432558e+00, 1.21454931e+00, -1.14703650e+00],
[-6.40820778e-01, 1.76738105e+00, -4.34158845e-01,
-1.95880190e+00, -6.46852958e-01, 6.48396695e-01],
[-6.24754977e-02, 2.36955791e-01, -7.06181374e-01,
-5.99382187e-01, 6.53424414e-01, -2.30024611e-01],
[ 1.23489386e+00, -1.47549684e-01, 5.87986383e-02,
3.20010005e-01, 4.58338698e-01, -9.66638515e-01],
[ 5.26336158e-01, 2.17542370e-01, 6.01233186e-01,
6.01633925e-01, 1.25008009e+00, -4.60496924e-01],
[ 4.42732842e-01, -2.28751942e+00, -1.07182966e-01,
-1.24856175e+00, -6.95979555e-01, 1.02781782e+00],
[ 2.44440855e-01, 2.12100527e-01, 8.73219238e-01,
2.12948099e+00, 9.10159412e-01, -6.83801744e-01],
[ 9.20630362e-02, 1.79092743e-01, -6.04521843e-01,
3.93210020e-01, -5.10872875e-02, -9.01347151e-01],
[-1.90500018e-01, -3.77676214e-01, 1.40173167e+00,
-6.76644044e-01, -5.94877340e-01, 1.16431008e+00],
[ 1.22055326e+00, 5.76189492e-01, 1.13972337e+00,
1.74745081e-01, 3.19772526e-01, 3.11688572e-01],
[-2.23343653e-01, -1.15253437e-01, 3.92435213e-01,
-2.85070809e-01, 5.49514272e-01, -4.71387384e-01],
[-2.03419071e+00, -6.26516482e-01, -2.05213605e-01,
-1.01632511e+00, -1.22356368e+00, 2.09040279e-01],
[-2.48208585e-01, 6.37987567e-01, 8.25699195e-01,
-1.03990657e+00, -6.57219120e-01, 7.03108275e-01],
[ 1.33278925e-01, -2.50996324e-01, 2.49996081e-01,
6.94763834e-01, 4.59955378e-01, -1.38575503e-01],
[-1.25145410e+00, 3.17642873e-01, -2.16452239e-02,
-1.11261887e+00, -1.33810528e+00, 2.84212909e-01],
[ 1.05412429e-01, 9.62104753e-01, -9.51999530e-03,
2.31124272e-01, 9.29254484e-01, -5.84621864e-01],
[-2.82752595e-01, -1.00763038e-01, 1.49771477e-01,
-7.81127368e-01, 1.10600040e+00, -1.03354257e+00],
[ 6.23017701e-02, -4.69149676e-02, 6.56474606e-01,
9.25057384e-01, 4.09338098e-01, -3.12559651e-01],
[-7.51786133e-01, -9.03077530e-01, 4.36249525e-01,
-7.56957751e-01, -1.52692591e+00, 2.13425814e+00],
[-5.52618543e-01, -8.24666042e-01, 6.60124236e-01,
-8.96155963e-01, -6.08543226e-01, 1.29536595e-02],
[-6.58158306e-01, -1.23012960e+00, -8.82904700e-01,
-8.24210805e-01, 1.25756607e-03, 3.65410610e-01],
[-3.55771528e-01, -9.40035435e-01, -7.09654842e-01,
6.27766387e-01, -5.05732477e-02, 4.99329011e-02],
[ 6.14123605e-01, 2.08646620e+00, 1.48454871e+00,
1.20886733e+00, 8.52635020e-01, -1.24475213e+00],
[-6.55693853e-01, -2.18584595e-01, 2.47348392e-01,
-1.89314446e-01, -5.02205260e-01, 9.44491123e-02],
[ 1.58617457e+00, 7.97110292e-01, 1.18959191e-01,
-4.41061011e-01, -5.62120845e-01, 7.86911707e-01],
[-4.41617876e-01, -7.18726629e-01, 8.28438586e-01,
-3.40672419e-01, -2.99811857e-01, 2.46776904e-01],
[-4.38669427e-01, 2.17491065e-01, 2.25606682e-01,
-2.14202060e-02, -1.09095078e-01, 6.85098299e-01],
[-1.03488502e+00, -1.24419495e+00, -1.36924538e-01,
-5.88683983e-02, -1.11152747e+00, 1.95226460e+00],
[ 6.13047931e-01, 5.96422528e-01, -3.11757828e+00,
3.91746020e-01, 9.27502200e-02, 5.18426272e-02],
[-5.18330456e-02, 1.37692159e+00, -1.08294970e+00,
-4.04799959e-01, -3.09276614e-01, 4.24008514e-01],
[-5.58457752e-01, -5.02145516e-01, 1.42395728e-01,
-6.84205280e-01, 8.81843285e-03, -1.96627258e-02],
[-6.17285171e-02, -1.10134851e-01, 1.22496465e+00,
-9.72875943e-01, -1.35134053e-01, 4.23881862e-01],
[-4.19565064e-01, -3.28739566e-01, 3.05450342e-01,
-4.13423496e-01, 3.23716320e-02, 1.30353765e-01],
[ 3.06583623e-01, -9.08665447e-01, 2.17643348e-01,
3.84861943e-01, 6.21764978e-02, 1.89300542e-01],
[-1.16145602e+00, -6.78658516e-01, -6.20818622e-01,
-5.48560462e-01, -1.94758103e+00, 2.39231451e+00],
[-5.33620819e-01, -1.70141038e+00, 6.17839206e-01,
-1.30733091e+00, -2.24324202e+00, 3.18795067e+00],
[-1.73552203e-02, -4.40813569e-01, 4.55334390e-01,
4.15213547e-01, 2.72688684e-01, -1.29939186e-03],
[-1.08938585e+00, -5.32551106e-01, -7.13046565e-01,
-3.15592112e-02, 8.25592857e-01, -5.20990424e-01],
[-1.02076352e+00, -9.12055617e-01, -5.51805464e-01,
-9.14764652e-01, -2.09721434e+00, 1.00236397e+00]])
# Initialise the Principal component analysis (PCA) algorithm with 2 components
pca = PCA(n_components=2)
# Apply the dimensionality reduction on the six mobility categories
pca_components = pca.fit_transform(mobility_trends_UK_standardised)
# Transformed values arranged as observations/samples in rows
# and number of components in columns
pca_components
array([[ 3.05048664e+00, -3.17134262e-01],
[ 4.05220779e-02, -3.23107889e-01],
[-1.66728899e+00, -9.58314439e-03],
[ 1.87640960e-01, 2.20371174e+00],
[-9.36968706e-01, 5.59535600e-01],
[-3.10038586e+00, -8.31210709e-01],
[-1.80731273e+00, 5.21006053e-01],
[ 3.72895574e+00, -6.37136213e-01],
[ 9.73241512e-01, 6.59415885e-01],
[ 1.82555443e+00, 7.43200181e-01],
[-7.88387434e-01, 2.10113385e+00],
[-2.64694215e+00, -1.56654726e-01],
[-5.21128998e-01, 7.71996986e-01],
[ 2.88282276e+00, 3.69933394e-02],
[ 7.95806245e-02, 6.71455150e-01],
[ 3.09657291e+00, -7.88220843e-01],
[ 4.17246911e+00, -1.29549204e+00],
[ 2.05759103e+00, -3.35013940e-01],
[-4.76628278e-01, 7.88229982e-01],
[ 1.95474734e+00, -3.13181233e-01],
[ 4.42040955e+00, 8.64391255e-01],
[-1.13652904e+00, 1.69484026e-01],
[-3.23532966e+00, 5.63305600e-01],
[ 6.64462577e-03, -8.03614441e-01],
[ 6.06777091e-03, 3.45992357e-01],
[ 6.94045764e-01, -5.52641275e-01],
[ 8.86133168e-01, -8.86819381e-02],
[-2.23501670e+00, -1.11539958e+00],
[-2.19232116e+00, -2.93389124e+00],
[-1.08759838e+00, -6.72046788e-01],
[-1.67095965e+00, -1.08452008e+00],
[ 1.34098351e-01, 1.59032516e+00],
[-2.00673439e+00, 5.90240968e-01],
[ 7.74526602e-01, -2.88043344e-01],
[-8.30288715e-01, -1.09432152e+00],
[-1.73796984e+00, 1.13179753e+00],
[-2.02516454e+00, -1.74505724e+00],
[-9.57214898e-01, -2.46917047e+00],
[-2.89027586e+00, -4.58575296e-01],
[ 1.05349976e+00, 2.49381531e-01],
[-7.99147025e-01, 9.72393052e-02],
[ 3.23777523e-03, 2.61323349e+00],
[-1.86889407e+00, -2.18463821e+00],
[-8.04253563e-01, -1.31575944e+00],
[ 5.75801802e+00, 3.29549538e-01],
[ 7.47404868e-01, -3.22291435e-01],
[ 6.09390139e-01, 3.01676098e-01],
[-2.33252485e+00, 1.77300318e+00],
[-9.71888291e-01, -4.70200550e-01],
[ 2.41040653e-01, 9.53640948e-01],
[ 3.86003082e+00, 7.50740681e-01],
[ 2.23720925e-01, -7.14428100e-01],
[ 4.20927731e+00, -8.65742676e-01],
[ 9.18627776e-01, -3.94862053e-01],
[-2.36462470e+00, -4.70201776e-01],
[ 5.90062753e-01, -7.48040735e-01],
[-1.35237809e+00, 2.39500247e+00],
[-1.15655267e+00, -4.46302266e-01],
[ 2.08749666e+00, -2.52827162e-01],
[-1.22947606e+00, -1.33521768e+00],
[-7.61608088e-01, 2.62135608e+00],
[-2.54409588e+00, 4.53659633e-01],
[-3.18022546e+00, -1.07995351e+00],
[-7.20799620e-02, -4.81291206e-01],
[-8.59679527e-01, 2.44635689e-01],
[-7.87912106e-01, -1.19741052e+00],
[ 1.35962335e+00, -7.33178259e-01],
[ 3.00396002e-01, -2.80062111e-01],
[-2.67717567e+00, -8.01910652e-01],
[ 3.61905954e-01, 1.79706169e+00],
[ 4.43242489e-01, 1.82238891e+00],
[ 2.96461767e-01, -4.00456765e-01],
[ 2.34515840e-01, -1.50004892e-01],
[-2.13867296e+00, 3.18389569e+00],
[-1.79419007e+00, 9.62754689e-01],
[-6.49185149e-01, 1.31426108e+00],
[-3.67981795e-01, 1.45779196e-01],
[ 1.75616656e+00, -9.69855787e-02],
[ 6.77395816e-01, 3.76009489e-01],
[-1.67671887e+00, 1.76192970e-01],
[-1.74816933e+00, 8.56887623e-01],
[ 7.18599097e-01, 3.75520941e-01],
[-2.43619115e+00, 3.70889175e-01],
[-1.55417418e+00, -1.06906504e+00],
[-2.72572155e+00, 7.77428437e-01],
[-2.88694212e+00, -2.34139233e-01],
[-3.18055273e-01, 2.49800819e+00],
[-2.66060406e+00, 3.63534553e-01],
[ 8.77899240e-01, -4.33989066e-01],
[-1.13158151e+00, -1.12833394e+00],
[-4.42362271e-01, 4.66876540e-01],
[-1.65391170e+00, -1.51175211e+00],
[ 2.95225548e+00, -1.12539691e+00],
[-3.22774726e-01, -2.58690709e-01],
[ 2.08733956e+00, -2.09796876e-01],
[-2.98045249e+00, -7.39130691e-01],
[ 7.67894239e-01, -4.91248794e-01],
[-3.29242285e-01, 1.44777214e+00],
[ 1.87126923e-01, -9.54311115e-01],
[ 1.07515504e+00, -4.14535492e-01],
[-1.09334242e+00, -2.64796296e-01],
[ 4.25783475e+00, -2.95174027e-02],
[-1.44701719e+00, 5.28405381e-01],
[ 1.02339970e+00, 9.35920489e-01],
[-2.29348699e-01, 9.21840129e-01],
[-1.29844792e+00, -6.28606110e-02],
[-1.38960255e+00, -5.36291856e-01],
[ 2.09824297e+00, 1.49265015e-01],
[-1.87165243e+00, -1.21773671e+00],
[-6.89792016e-01, 4.91731539e-01],
[ 1.37198238e+00, -1.28599091e+00],
[-8.50761161e-01, -1.02196425e+00],
[-2.09805119e-01, -2.52067293e-01],
[ 2.27652791e+00, 1.95688169e-01],
[ 9.48834294e-01, -5.75995451e-01],
[-5.40630961e-01, -3.67055405e-01],
[ 1.67585030e+00, 1.19838815e-01],
[-1.25665227e+00, 1.47519686e-01],
[-5.02861307e-01, 1.70139533e-01],
[-7.49134915e-01, -8.04724329e-01],
[ 2.76393315e+00, -5.10814314e-01],
[ 1.27038505e+00, -5.68920759e-01],
[ 1.32154816e+00, 9.36154333e-01],
[ 3.12556428e-01, 4.39868270e-01],
[-2.63579992e+00, -1.43324838e+00],
[ 7.45754614e-01, -2.85829995e-01],
[-1.98917555e-01, 2.01538157e-02],
[ 9.00949676e-01, -8.14746764e-01],
[ 4.90567982e-01, -2.15832748e-01],
[ 2.43937985e+00, -1.11579142e-01],
[-7.22385265e-01, 2.96453502e+00],
[-2.38463941e-02, 1.20484219e+00],
[ 7.38469276e-01, -2.08238125e-02],
[ 7.66860023e-01, -9.64093589e-01],
[ 5.48432138e-01, -2.22903710e-01],
[ 1.29598158e-01, -3.58566640e-01],
[ 3.07770134e+00, 4.07670688e-01],
[ 4.07090164e+00, -6.98024160e-01],
[-1.19942157e-01, -5.37613472e-01],
[ 7.43040484e-02, 7.29267596e-01],
[ 2.68592412e+00, 3.93415496e-01]])
Now we can run the k-means algorithm on the two principal components:
k = 4
kmeans_k4_pca = KMeans(
n_clusters=k, init="k-means++", n_init=10, max_iter=300, random_state=0
)
kmeans_k4_pca.fit(pca_components)
KMeans(n_clusters=4, random_state=0)
# Labels of clusters to which each observation was assigned to
kmeans_k4_pca.labels_
array([3, 0, 1, 2, 2, 1, 1, 3, 0, 0, 2, 1, 2, 3, 0, 3, 3, 3, 2, 3, 3, 1,
1, 0, 0, 0, 0, 1, 1, 1, 1, 2, 1, 0, 1, 2, 1, 1, 1, 0, 1, 2, 1, 1,
3, 0, 0, 2, 1, 0, 3, 0, 3, 0, 1, 0, 2, 1, 3, 1, 2, 1, 1, 0, 1, 1,
0, 0, 1, 2, 2, 0, 0, 2, 2, 2, 0, 0, 0, 1, 2, 0, 1, 1, 1, 1, 2, 1,
0, 1, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 1, 3, 1, 0, 2, 1, 1, 3, 1, 2,
0, 1, 0, 3, 0, 0, 0, 1, 0, 1, 3, 0, 0, 0, 1, 0, 0, 0, 0, 3, 2, 2,
0, 0, 0, 0, 3, 3, 0, 0, 3], dtype=int32)
# Add the 4-cluster assignment on the PCA components to your DataFrame
mobility_trends_UK_mean_NaNdrop["clusters_k4_pca"] = kmeans_k4_pca.labels_
mobility_trends_UK_mean_NaNdrop
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters | clusters_k4 | clusters_k4_pca | |
---|---|---|---|---|---|---|---|---|---|
sub_region_1 | |||||||||
Aberdeen City | -50.046371 | -10.722567 | 20.557692 | -46.127016 | -42.489919 | 14.567010 | 2 | 3 | 3 |
Aberdeenshire | -28.253669 | -11.248447 | 22.474684 | -39.953878 | -37.207661 | 12.222222 | 0 | 2 | 0 |
Angus Council | -25.955975 | -6.125786 | 13.982143 | -31.150943 | -33.542339 | 10.831551 | 1 | 0 | 1 |
Antrim and Newtownabbey | -29.377358 | -7.465409 | -29.134328 | -53.752621 | -33.679435 | 12.859031 | 0 | 1 | 2 |
Ards and North Down | -27.262055 | 0.452830 | 6.838298 | -41.721311 | -35.991935 | 12.679039 | 1 | 1 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Windsor and Maidenhead | -42.714885 | -11.178197 | 0.379455 | -43.693920 | -43.711694 | 16.709220 | 2 | 3 | 3 |
Wokingham | -39.044025 | -16.285115 | 30.458101 | -51.299790 | -45.034274 | 18.237327 | 2 | 3 | 3 |
Worcestershire | -36.025497 | -9.990563 | 26.511954 | -34.033107 | -33.779758 | 12.112019 | 0 | 2 | 0 |
Wrexham Principal Area | -42.293501 | -10.448637 | -1.860140 | -38.511530 | -31.306452 | 11.113895 | 0 | 2 | 0 |
York | -41.892276 | -12.343621 | 2.055319 | -47.364729 | -44.381048 | 14.039666 | 2 | 3 | 3 |
141 rows × 9 columns
Visualising mobility clusters¶
Let’s plot the resulting clusters along the two principal components using a scatter plot.
# Set figure size
plt.figure(figsize=(11.7, 8.27))
# Scatterplot with the 1st principal component on the horizontal x axes
# and the 2nd principal component on the vertical y axis
grid = sns.scatterplot(
x=pca_components[:, 0],
y=pca_components[:, 1],
hue=kmeans_k4_pca.labels_,
alpha=0.8,
s=120,
)
# Add labels to the horisontal x axis and vertical y axis
labels = grid.set(xlabel="1st principal component", ylabel="2nd principal component")
# Plot the cluster centroids
sns.scatterplot(
x=kmeans_k4_pca.cluster_centers_[:, 0],
y=kmeans_k4_pca.cluster_centers_[:, 1],
hue=range(k),
s=220,
alpha=0.8,
ec="black",
legend=False,
)
# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title="Cluster", loc="upper right")
<matplotlib.legend.Legend at 0x7fcaa1b6e580>
In the figure above, we plot the 1st principal component against the 2nd principal component derived from the six mobility types. Each data point is a county in the UK. Larger dots represent the cluster centroid (which is typically not a data point). Colour scheme represents cluster assignment.
The figure above lacks county labels which we would need in order to interpret our results from k-means clustering.
Let’s add labels to data points so that we can associate each county with its name.
# Enlarge figure size
plt.figure(figsize=(32, 24))
# Scatterplot with the 1st principal component on the horizontal x axes
# and the 2nd principal component on the vertical y axis
grid = sns.scatterplot(
x=pca_components[:, 0],
y=pca_components[:, 1],
hue=kmeans_k4_pca.labels_,
alpha=0.9,
s=120,
)
# Add labels to the horisontal x axis and vertical y axis
labels = grid.set(xlabel="1st principal component", ylabel="2nd principal component")
# Plot the cluster centroids
sns.scatterplot(
x=kmeans_k4_pca.cluster_centers_[:, 0],
y=kmeans_k4_pca.cluster_centers_[:, 1],
hue=range(k),
s=240,
alpha=0.8,
ec="black",
legend=False,
)
# This for loop assign country name to each data point iteratively
for line in range(0, mobility_trends_UK_mean_NaNdrop.shape[0]):
grid.text(
pca_components[line, 0] + 0.1,
pca_components[line, 1], # where the labels should be positioned
mobility_trends_UK_mean_NaNdrop.index[line], # add labels to each data point
horizontalalignment="left",
size="small",
color="black",
weight=None,
)
# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title="Cluster", loc="upper right");
Because PCA transforms our six variables into a two-dimensional space, we cannot anymore see how a particular cluster or county is positioned with respect to any particular mobility category.
If you need to cluster counties with regard to any pair of variables, you could run k-means on particular pairs of variables and plot the cluster assignment for those variables. For example, below we run the k-means algorithm on two variables: retail and recreation mobility and workplaces mobility.
# We first fit k-means to two variables retail_recreation and workplaces
# using the standardised data. We specify the number of clusters to be
# formed as k = 4 but keep in mind that we did not performed the Elbow method
# on these two variables in particular.
k = 4
kmeans_k4_2vars = KMeans(
n_clusters=k, init="k-means++", n_init=10, max_iter=300, random_state=0
)
# 0 indicates the retail_recreation mobility variable
# and 4 indicates workplaces mobility variable
kmeans_k4_2vars.fit(mobility_trends_UK_standardised[:, [0, 4]])
KMeans(n_clusters=4, random_state=0)
Plot the resulting clusters along the two mobility variables — retail and recreation mobility and workplaces mobility — using a scatter plot.
# Plot the clusters
plt.figure(figsize=(11.7, 8.27))
grid = sns.scatterplot(
x=mobility_trends_UK_standardised[:, 0],
y=mobility_trends_UK_standardised[:, 4],
hue=kmeans_k4_2vars.labels_,
alpha=0.8,
s=120,
)
# Plot the centers
sns.scatterplot(
x=kmeans_k4_2vars.cluster_centers_[:, 0],
y=kmeans_k4_2vars.cluster_centers_[:, 1],
hue=range(k),
s=220,
alpha=0.8,
ec="black",
legend=False,
)
grid.set(
xlabel="Retail and Recreation Mean Change Mobility",
ylabel="Workplaces Mean Change Mobility",
)
# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title="Cluster", loc="upper right")
<matplotlib.legend.Legend at 0x7fcaaf0d9df0>
Let’s add UK county labels in the figure below as we did before.
# Enlarge figure size
plt.figure(figsize=(28, 22))
# Scatterplot with the 1st principal component on the horizontal x axes
# and the 2nd principal component on the vertical y axis
grid = sns.scatterplot(
x=mobility_trends_UK_standardised[:, 0],
y=mobility_trends_UK_standardised[:, 4],
hue=kmeans_k4_2vars.labels_,
alpha=0.9,
s=120,
)
grid.set(
xlabel="Retail and Recreation Mean Change Mobility",
ylabel="Workplaces Mean Change Mobility",
)
# Plot the cluster centroids
sns.scatterplot(
x=kmeans_k4_2vars.cluster_centers_[:, 0],
y=kmeans_k4_2vars.cluster_centers_[:, 1],
hue=range(k),
s=240,
alpha=0.8,
ec="black",
legend=False,
)
# This for loop assign country name to each data point iteratively
for line in range(0, mobility_trends_UK_mean_NaNdrop.shape[0]):
grid.text(
mobility_trends_UK_standardised[line, 0] + 0.1,
mobility_trends_UK_standardised[
line, 4
], # where the labels should be positioned
mobility_trends_UK_mean_NaNdrop.index[
line
], # add labels to each data point iteratively
horizontalalignment="left",
size="small",
color="black",
weight=None,
)
# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title="Cluster", loc="upper right");
Hands-on exercise¶
You would like to know whether mobility trends in the UK over the last year of the pandemic were similar to the mobility trends in some other countries, and to which countries in particular.
To learn this, you use k-means clustering to group world countries in the COVID-19 Community Mobility Reports data set according to their mobility across mobility categories.
Write your Python code and Markdown below.
Below is a solution to the hands-on exercise.
# Compute mean mobility trends by country and remove NaN (Not a Number) values
mobility_trends_countries = (
mobility_trends.groupby("country_region")[
[
"Retail_Recreation",
"Grocery_Pharmacy",
"Parks",
"Transit_stations",
"Workplaces",
"Residential",
]
]
.mean()
.dropna()
)
mobility_trends_countries.head()
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | |
---|---|---|---|---|---|---|
country_region | ||||||
Afghanistan | 14.629630 | 37.950231 | 6.611236 | -4.188804 | -7.533917 | 4.368709 |
Angola | -12.054187 | -0.541706 | 6.636268 | -27.115071 | -11.890746 | 7.796813 |
Antigua and Barbuda | -18.242138 | -8.100629 | 33.349057 | -43.980952 | -33.560899 | 4.367725 |
Argentina | -41.844881 | -8.477221 | -59.743529 | -46.950210 | -12.739573 | 10.504889 |
Aruba | -19.974843 | -6.563941 | 12.723270 | -45.807128 | -21.149194 | 5.266667 |
# Data standardisation
scaler = StandardScaler()
StandardisedData = scaler.fit_transform(mobility_trends_countries)
StandardisedData
array([[ 2.60971986e+00, 2.72237961e+00, 3.47561843e-01,
1.37660526e+00, 1.40145901e+00, -6.80948462e-01],
[ 5.37783963e-01, -2.72062247e-02, 3.48512303e-01,
-1.55760852e-01, 9.02158250e-01, 6.76779321e-02],
[ 5.73040443e-02, -5.67161074e-01, 1.36277535e+00,
-1.28305749e+00, -1.58128178e+00, -6.81163370e-01],
[-1.77539368e+00, -5.94062123e-01, -2.17187512e+00,
-1.48151933e+00, 8.04881058e-01, 6.59065338e-01],
[-7.72363952e-02, -4.57391215e-01, 5.79630869e-01,
-1.40511702e+00, -1.58877118e-01, -4.84853223e-01],
[ 6.85588402e-01, 1.35974506e-01, -3.45209638e-01,
-3.95424093e-01, 9.96267366e-01, -1.50054232e-01],
[-9.59256359e-01, -4.91423251e-01, 1.70358047e-01,
-1.04091624e-02, -7.87004065e-01, -1.42716295e-02],
[-4.00453115e-01, -5.48873754e-01, -9.93222721e-01,
3.57482646e-01, 1.16189086e-01, 6.58432689e-01],
[ 2.80588000e-01, 6.12406852e-01, 7.55117751e-02,
1.14437068e+00, 1.53935251e+00, 5.75779225e-01],
[-7.71871615e-01, -1.49054175e+00, -9.97494429e-04,
-1.59206991e+00, -1.77993395e+00, 1.03694700e-01],
[ 3.08739143e-01, 1.67161537e-01, 6.71900417e-01,
1.07047510e+00, 3.02930773e-01, -1.84553224e+00],
[-6.62714327e-01, -1.62237222e-01, 1.59660009e+00,
-2.27766518e-01, -9.03535011e-01, 7.64825718e-01],
[-5.96197175e-02, 5.84356789e-02, 9.41737687e-02,
-1.06134513e+00, -1.23523870e+00, -6.04717583e-02],
[ 8.90963077e-01, 2.47462459e+00, 5.31298819e-01,
1.08774909e+00, 1.28879947e+00, -9.24507380e-01],
[-1.18471820e+00, -1.19493584e+00, -1.06310121e+00,
-1.06901514e+00, -1.54589556e-01, 1.40575016e+00],
[ 2.25368450e-01, 3.02058487e-01, 4.13225825e-01,
5.13928237e-01, 3.96301617e-01, -1.85771275e+00],
[ 1.27271809e+00, 1.73500103e+00, 9.79545504e-01,
1.10532642e+00, 7.39980250e-01, -1.71686413e-02],
[-4.61882443e-01, 7.37375413e-01, -1.09596198e+00,
-1.29264376e-01, 1.69589951e+00, 1.44929501e-01],
[ 3.46967054e-02, 2.40132651e-01, 5.35903503e-01,
5.28232291e-01, -1.03415179e-01, -1.22149250e+00],
[ 2.19300594e+00, 2.84270168e+00, 8.62167725e-01,
2.02791800e+00, 1.75426265e+00, -1.39616643e+00],
[ 1.82917067e-02, -1.04820437e+00, -7.09149643e-01,
-1.07118079e+00, -3.82246744e-01, 7.75248625e-01],
[ 1.08455597e+00, 6.31392933e-01, -5.08904929e-01,
2.19165550e+00, 8.57361761e-01, -1.14101040e+00],
[ 3.65005703e-01, 3.55186727e-02, 1.52413515e+00,
-8.26928312e-01, -4.13658797e-01, 3.04221019e-01],
[-1.44112442e+00, -1.07621542e+00, -1.08243975e+00,
-2.46797352e+00, -1.76142224e+00, 2.30220328e-01],
[-1.87870816e+00, -1.71853360e+00, -1.69105187e+00,
-1.18349566e+00, -2.79844659e-01, 2.02049532e+00],
[-1.20390316e+00, -8.43807783e-01, -1.08170059e+00,
-7.24153849e-01, 4.55320438e-01, 1.06722439e+00],
[-1.24096582e+00, -1.12887372e+00, -1.57856853e+00,
-1.08780832e+00, -6.01099888e-01, 8.50275728e-01],
[ 3.81318352e-01, 3.54428087e-01, 1.84889905e+00,
4.64224472e-01, -9.61585749e-02, -8.76848593e-01],
[-1.56975757e-01, 5.08740222e-01, 1.04396792e+00,
9.27085801e-01, 2.45472164e-01, -1.38746564e-01],
[ 1.44861882e+00, 1.44764869e+00, -1.62839254e-01,
1.69888816e+00, 1.15778143e+00, -8.06715212e-01],
[ 1.23934036e+00, -1.40082330e-01, 2.63231030e+00,
3.13547434e-02, -5.62324821e-01, -7.28117566e-02],
[-8.67416574e-01, -6.46741166e-01, -1.12970672e+00,
-6.97323561e-01, -1.50050416e+00, 5.58797657e-02],
[-2.58895312e-01, -5.75439484e-01, -9.21458086e-01,
-4.44944247e-01, -1.15329882e+00, 1.38601320e+00],
[ 6.02936229e-03, 2.81728908e+00, -5.35044981e-02,
1.58307991e-01, 8.63472906e-01, -1.00018395e+00],
[-5.06634907e-01, -5.65536060e-01, -1.05267817e+00,
-4.69905338e-01, -1.03188047e+00, 7.33236883e-02],
[ 6.11806991e-01, 3.62296169e-01, 1.33902110e+00,
8.62924345e-01, -3.14813743e-01, -5.69959274e-01],
[ 7.80561089e-01, -1.18055774e-01, -4.57306463e-01,
-2.09416726e-01, 6.06462989e-01, 6.41466978e-01],
[ 7.29302622e-01, 2.51623905e-01, 2.15083686e+00,
-4.85176822e-01, -1.47063067e-01, -4.70006546e-01],
[-6.57191514e-01, -7.80873179e-02, 1.26279338e+00,
4.06174460e-01, -7.28855645e-01, 2.53747128e-01],
[ 2.36962834e-01, -7.62992332e-01, 7.45473737e-01,
1.09461513e+00, 3.56426654e-01, -2.12691317e-01],
[-7.05638191e-01, -1.95306217e-01, -1.53334955e-02,
4.13918140e-01, -7.73970300e-01, -3.72016030e-01],
[-5.45730847e-01, -2.98970040e-02, 2.31379853e+00,
-8.51011365e-02, -1.28256400e-01, -4.78066133e-02],
[ 1.32254121e+00, 1.05570745e+00, -3.64732051e-01,
1.43533729e+00, 9.20304574e-01, 7.51091572e-01],
[-3.00954152e-01, 1.03619634e+00, 2.16237108e+00,
4.97592955e-01, -3.07546750e-01, -4.98008899e-01],
[-4.69571014e-01, -1.07233259e+00, -1.07699881e+00,
-8.04957091e-01, -5.57824834e-01, 4.23465773e-01],
[ 2.85764880e-01, -1.02597519e+00, -6.31108931e-01,
3.31947469e-01, -6.35487520e-01, -5.78897823e-01],
[-1.35510656e+00, -1.01540679e+00, -1.09023540e+00,
-9.45689787e-01, -1.11893544e+00, -4.14700614e-01],
[-1.02855317e-01, 1.22499269e+00, -6.44314566e-01,
2.56824887e-01, 6.30527743e-01, 6.71918254e-01],
[ 8.92127310e-01, 2.29386424e-01, 1.50473428e+00,
1.00388711e+00, 3.67339223e-02, -3.80945919e-01],
[-1.08182289e+00, 6.76489015e-01, -6.63625259e-01,
1.42759251e-01, 2.31669995e-01, 1.19269034e+00],
[ 2.62193985e-01, 1.41332519e-01, -3.62044131e-01,
-3.56423625e-01, -1.74059546e-01, -2.54434424e-01],
[ 1.14523025e+00, 2.05699951e+00, 4.26280866e-03,
1.44021928e+00, 5.83043304e-01, -5.23946861e-01],
[-1.13155958e+00, 1.69253780e-01, 2.75169920e-01,
-7.49139637e-01, -1.39161526e+00, 8.00672452e-01],
[-6.87765469e-01, -2.33985579e-02, 3.41621888e-01,
3.78260253e-02, -6.99813186e-01, 4.89300105e-01],
[-7.22326936e-01, -5.94639834e-01, 7.13854772e-01,
-3.70624953e-01, -7.44910147e-01, 3.74841029e-01],
[-8.39476322e-01, -1.15902052e+00, -1.03352859e+00,
-3.45466649e-01, -1.03308583e+00, 1.36618261e-01],
[ 7.37480150e-01, 6.61044393e-02, 8.75131152e-02,
1.50411332e-01, 9.19477021e-01, -3.74871551e-01],
[ 3.57109571e-02, 2.24336775e-01, -7.93705029e-01,
-1.62148327e+00, -8.16589366e-01, -6.25899086e-02],
[-2.11592974e-01, -3.03697124e-01, -4.32365023e-02,
2.92252461e+00, -7.39443974e-01, -1.11339656e+00],
[ 1.26647294e+00, 5.94257869e-03, -3.43534767e-01,
1.38010088e+00, 1.69702652e+00, 6.72628604e-01],
[-1.18456848e+00, -1.55512752e+00, -1.14150650e+00,
2.21895784e-01, -1.24664414e+00, 1.43569148e+00],
[-1.73578735e-01, -1.01294213e+00, -4.47074113e-01,
5.07691249e-01, -1.05599756e+00, -1.78675606e+00],
[ 9.37386549e-02, -4.19099473e-01, -7.09437029e-01,
-5.30462583e-01, 6.73218079e-01, -3.48199496e-01],
[ 2.21835069e-01, -2.73259256e-03, 1.12106553e+00,
1.23114455e-02, -1.09271001e+00, -2.78330668e-01],
[ 2.18286027e-01, 1.13549692e+00, 2.35285460e-01,
-1.10493340e+00, -1.05692457e+00, -1.36534621e+00],
[ 2.98101723e+00, 3.11024005e+00, 2.12204593e+00,
3.94831985e-01, 1.59577980e+00, -1.68582793e+00],
[-7.27894584e-01, 1.41651057e+00, 1.51666131e+00,
5.14749299e-01, -2.83665433e-01, -7.10842852e-01],
[-1.06061289e+00, -5.94832064e-01, 1.09639399e+00,
-1.71103732e-01, -1.24531026e+00, 8.47550774e-01],
[-1.05455079e+00, -3.66439793e-01, -8.97859471e-01,
-1.29984295e+00, -5.21713309e-01, 1.56948876e+00],
[ 1.16348395e+00, 4.75588900e-01, 1.15832278e+00,
4.59265128e-01, 1.39381727e+00, -1.37092735e+00],
[-1.36729495e-01, -2.74241777e-01, 3.27713121e-01,
3.61876437e-01, -4.61157768e-01, 3.92636550e-01],
[-2.76393329e-01, -1.01310328e+00, -1.55316146e+00,
-1.03734911e+00, -4.99124882e-01, -7.61345795e-01],
[-7.70003660e-01, -2.81684308e-01, -1.12465032e+00,
-2.87550191e-01, -1.17449018e-01, 5.73076839e-01],
[ 9.74925551e-03, -4.50831561e-01, -1.34377696e-01,
-5.48679872e-02, -5.06359301e-01, -1.64890400e+00],
[ 1.24065863e+00, 2.75304023e+00, 6.29501724e-01,
2.23676034e+00, 9.74103761e-01, -2.77509033e-01],
[-1.11776590e+00, 1.59406385e-01, -1.23565452e+00,
-3.21890653e-01, -5.45454061e-01, 9.98943208e-01],
[ 4.69402577e-01, -8.45542128e-02, -6.00500276e-01,
-1.13941863e-01, 1.52232371e+00, -1.06443750e+00],
[-1.90272391e+00, -1.92179648e+00, -1.19467045e+00,
-1.13599833e+00, -1.80738726e+00, 2.28047066e+00],
[ 2.88765451e-01, 3.13999559e-02, -6.45458333e-01,
-8.28018621e-01, 5.96789709e-01, -9.22651426e-01],
[-1.08659856e+00, -6.47769726e-01, -4.55224953e-01,
3.30054738e-01, -6.98351671e-01, 7.02804548e-01],
[-3.90295177e-02, 4.56995069e-02, 1.38454429e+00,
-8.28937753e-01, -7.20447101e-01, 3.95630030e-01],
[ 7.22462099e-01, -3.75039858e-01, -7.26638867e-01,
-4.57408124e-01, 1.56789462e+00, -5.06454188e-01],
[ 3.32725303e-01, 1.52992127e-01, -5.09452311e-01,
8.81417317e-01, 9.66447057e-01, -1.03471032e+00],
[ 2.44297544e+00, 2.03023347e-01, 3.21765590e-01,
2.34788654e+00, 3.90246625e-01, -1.55318205e+00],
[ 6.64022678e-01, -4.86691389e-02, -5.08901804e-01,
1.37873314e+00, 8.04864874e-01, 4.26128643e-01],
[-2.28607401e-01, -8.85450250e-02, 8.74516396e-01,
-4.14359983e-01, -5.50419381e-01, -8.75006572e-01],
[ 8.36895135e-01, 8.31491575e-02, 1.65905086e+00,
-4.04547987e-01, -6.04913782e-01, -1.22420188e-01],
[-5.73843041e-01, -1.19207072e+00, -1.21316319e+00,
-1.02636257e+00, -6.40534256e-01, -6.42782493e-01],
[ 9.74902967e-01, 7.55325962e-01, 1.73712948e-01,
1.45859689e+00, 1.09746131e+00, -7.84225137e-01],
[-2.35432806e+00, -1.97468330e+00, -1.98136722e+00,
-1.59429260e+00, -2.44408967e+00, 3.03471977e+00],
[ 2.36549057e+00, 2.28610650e+00, -5.37634567e-01,
1.52710614e+00, 3.21864322e+00, -1.04768289e+00],
[-5.52455912e-01, -8.55431804e-01, -1.27449222e+00,
-1.19894010e+00, 6.00776243e-01, 7.70940231e-01],
[-1.78662933e+00, -1.81856477e+00, -1.24490884e+00,
-1.68865348e+00, -6.07902102e-01, 2.25359431e+00],
[-1.58295775e+00, -1.11387360e+00, -8.54702332e-01,
-1.52631532e+00, -1.29484305e+00, 2.53025043e+00],
[ 6.94690480e-01, 1.72022752e-01, 8.95435053e-01,
3.41291496e-01, 3.79257596e-01, -4.55571767e-01],
[-7.15039222e-01, -1.06413572e-01, -4.92868154e-03,
-5.19526884e-01, -5.95455526e-01, 9.35535450e-01],
[-2.96899874e-01, -1.13370453e+00, -1.26899256e+00,
-1.59261294e+00, -5.55753864e-01, 4.31728817e-01],
[ 1.48571513e-01, 7.04614894e-01, -2.13492790e-01,
3.11214695e-01, 8.92491173e-01, 1.45981213e-01],
[-6.80774811e-02, -6.42804248e-02, -1.05594139e-01,
2.56951091e-01, -2.22742484e-01, -8.73205046e-01],
[ 3.08211423e-01, -1.60195428e-01, 6.52301211e-01,
1.32501737e+00, -2.68802818e-01, -1.10917433e+00],
[ 8.24337346e-01, -1.44237233e+00, 1.32557535e-01,
-1.06086149e-01, -1.04545042e-01, 2.37919140e+00],
[ 3.16141149e-01, 7.74592178e-02, -5.38805742e-01,
-7.79836271e-01, -2.33639636e-01, 6.92678163e-02],
[-1.07933951e-01, 1.50911972e-01, -5.62803762e-01,
4.47684124e-02, 5.25634699e-01, -9.09607672e-01],
[-9.46513642e-02, -4.50160387e-02, 2.89666314e-01,
3.72381505e-01, -6.94852381e-01, -6.59206118e-01],
[-5.08262713e-01, -1.39202530e-01, -7.66390821e-01,
-5.15550322e-01, -3.70069830e-01, 2.39850561e+00],
[-5.71632246e-01, -2.50361551e-01, 6.16595255e-01,
-1.75982070e-01, -3.67596173e-01, 3.09117696e-02],
[-9.03769214e-01, -1.08991709e+00, -1.39965133e-01,
-4.67027588e-01, -7.64202266e-01, 1.84469257e-01],
[-5.74441020e-02, -2.42177474e-01, -7.35944703e-01,
-2.40519263e-01, -2.32484464e-02, 1.20947028e+00],
[ 5.75324281e-01, 7.34283685e-01, 1.27604173e+00,
1.05176821e+00, 1.33183638e+00, -5.84849413e-01],
[-1.09601070e+00, -3.93182295e-01, 2.74186539e-01,
-2.20528053e-01, -4.80528307e-01, 1.12893681e-01],
[-1.11212020e+00, -1.16245567e+00, -7.38783737e-01,
-3.23554967e-01, -1.18232768e+00, 2.10921849e+00],
[ 8.50143823e-01, -1.13966637e-01, 1.66562682e+00,
-2.22157492e-01, -4.75398409e-01, -8.14568782e-03],
[-6.13501872e-01, 4.69708012e-02, 1.18117932e+00,
2.58239604e-01, -2.01121515e-01, 1.28358576e-01],
[ 5.09354707e-01, 1.24188035e-01, -2.51764528e-01,
4.80641941e-01, 1.77837273e+00, -9.73755711e-01],
[ 6.68980969e-01, -4.16822731e-01, -5.87897129e-01,
-1.83685952e-01, 9.99673596e-01, -1.15211319e+00],
[ 1.01969072e+00, -5.88470030e-02, -7.09130021e-02,
1.40541017e+00, 3.32660016e-01, -6.17531539e-01],
[ 4.35573964e-01, 2.85125220e-01, -8.51556783e-01,
-7.35506165e-01, 6.15003984e-01, -4.96975731e-01],
[-1.08256032e+00, -1.27497352e+00, -2.65358171e-02,
-1.87368367e+00, -1.48449982e+00, -5.66767330e-01],
[ 9.23056224e-01, 1.21560113e+00, 1.66357753e+00,
1.18002037e+00, 1.06280526e+00, 1.84262818e-01],
[-5.51726840e-01, 1.42828859e-01, -7.56255617e-01,
-1.15265003e+00, -1.00471762e+00, 2.76486683e-01],
[-7.54648060e-01, 6.26460624e-01, 8.85409218e-02,
-3.18897991e-03, -1.05399021e-01, 5.20897073e-02],
[-5.56232845e-01, -1.10094613e+00, -3.95866018e-01,
-4.32757885e-01, 1.52130683e+00, 9.19804874e-01],
[ 5.12025871e-02, 2.24005359e-03, 6.53180545e-01,
5.21450940e-01, -5.20969897e-01, -1.07163481e+00],
[ 2.65604181e-01, -1.86047121e-02, -1.01144650e+00,
-8.02645504e-01, 3.05018590e-01, 6.45865495e-01],
[-1.38442212e+00, -6.07023246e-01, 1.16367852e+00,
-8.92854726e-01, -1.84434336e+00, 1.15244882e+00],
[ 8.96489450e-01, 1.06708267e-01, 1.07430043e+00,
9.80791148e-01, -2.55278869e-01, -1.17378150e-01],
[-6.22636763e-01, -3.43112341e-01, -1.50835633e+00,
-9.00068762e-01, 1.33020572e+00, -1.29723100e-01],
[-8.70205981e-01, -3.66734412e-01, -8.65549414e-01,
-7.35106730e-01, -8.99572993e-02, 1.17479933e+00],
[ 8.77850377e-02, -1.16266411e-01, -6.59880105e-01,
5.61535775e-01, 1.84240262e+00, -2.47632362e+00],
[ 3.22928239e+00, 2.52842736e+00, 1.03820640e+00,
2.37401223e+00, 2.22079986e+00, -1.15385326e+00],
[ 1.84081097e+00, 9.05537356e-01, 6.46096859e-01,
1.26147187e+00, 1.26711600e+00, 2.22544422e-01],
[ 6.78955505e-01, 6.39393713e-01, -2.37046725e-01,
2.38439226e-01, 1.33892009e+00, 8.61031512e-01]])
# Run PCA with two components
pca_countries = PCA(n_components=2)
pca_countries = pca_countries.fit_transform(StandardisedData)
pca_countries
array([[-3.99016687e+00, 7.45478364e-01],
[-6.06845801e-01, 2.89283308e-01],
[ 7.63659966e-01, -2.19230651e+00],
[ 2.32057062e+00, 2.10239002e+00],
[ 5.72102336e-01, -7.20155393e-01],
[-5.64747850e-01, 8.55920787e-01],
[ 9.41844934e-01, -6.74254911e-01],
[ 7.71024049e-01, 9.38877429e-01],
[-1.32368338e+00, 1.03287876e+00],
[ 2.47774589e+00, -1.19196393e+00],
[-1.68519673e+00, -6.15938441e-01],
[ 6.66457601e-01, -1.70202543e+00],
[ 8.98595586e-01, -8.52468676e-01],
[-3.02763069e+00, 3.62569086e-01],
[ 2.46416516e+00, 8.36796320e-01],
[-1.43099629e+00, -3.84012433e-01],
[-2.45986827e+00, -1.49673660e-01],
[-3.53389857e-01, 1.87222486e+00],
[-9.21811801e-01, -6.69920702e-01],
[-4.68033064e+00, 4.23018062e-01],
[ 1.57607279e+00, 3.90821641e-01],
[-2.37166071e+00, 8.79282529e-01],
[ 3.35456370e-03, -1.39772761e+00],
[ 3.34249607e+00, -3.69875237e-01],
[ 3.54260283e+00, 1.30182544e+00],
[ 1.80436162e+00, 1.17274771e+00],
[ 2.58774159e+00, 8.79736352e-01],
[-1.36658692e+00, -1.61384962e+00],
[-1.00894477e+00, -6.54182914e-01],
[-2.79744195e+00, 8.75387540e-01],
[-1.10997212e+00, -2.34111002e+00],
[ 1.94873016e+00, -8.37499479e-02],
[ 1.81075555e+00, 2.47568769e-01],
[-2.04783053e+00, 4.79725413e-01],
[ 1.44045285e+00, 1.67215156e-01],
[-1.30744201e+00, -1.25233176e+00],
[-9.74208903e-02, 8.64820168e-01],
[-9.90130797e-01, -1.83079072e+00],
[ 1.88557279e-01, -1.39924217e+00],
[-6.76024133e-01, -3.85374162e-01],
[ 4.17152904e-01, -5.40002802e-01],
[-3.23711846e-01, -1.94389918e+00],
[-1.71625520e+00, 1.14458868e+00],
[-1.23234464e+00, -1.92954160e+00],
[ 1.74984389e+00, 4.94598485e-01],
[ 4.03347440e-01, 1.37301861e-02],
[ 2.12305719e+00, -3.29723304e-02],
[-4.32965084e-01, 1.04893747e+00],
[-1.55831055e+00, -1.12525093e+00],
[ 6.89855941e-01, 8.39591676e-01],
[ 4.38759722e-02, 1.43670366e-01],
[-2.53534797e+00, 4.53710835e-01],
[ 1.55320122e+00, -9.78265640e-01],
[ 6.80341516e-01, -6.29878654e-01],
[ 1.00254362e+00, -1.01185907e+00],
[ 1.83430181e+00, 1.24819785e-01],
[-9.73263581e-01, 4.53666840e-01],
[ 1.11254053e+00, 7.84332061e-02],
[-1.14112634e+00, -4.97929405e-01],
[-1.52715296e+00, 1.52008806e+00],
[ 2.52856784e+00, 3.09175967e-01],
[ 2.05912854e-01, -6.16036147e-01],
[ 1.88781138e-01, 8.54548922e-01],
[-1.08495433e-01, -1.55613670e+00],
[-2.99333532e-01, -1.03978979e+00],
[-4.87791019e+00, -7.25434649e-01],
[-1.11014169e+00, -1.45684088e+00],
[ 1.33845540e+00, -1.53123283e+00],
[ 2.28182526e+00, 5.52463272e-01],
[-2.36185105e+00, -2.38414033e-01],
[ 2.64381591e-01, -4.58973626e-01],
[ 1.40771913e+00, 6.92734687e-01],
[ 1.20495805e+00, 8.50272455e-01],
[-1.48009628e-01, -5.08049920e-01],
[-3.48733778e+00, 3.01272427e-01],
[ 1.54303122e+00, 7.59552832e-01],
[-9.54089064e-01, 1.18778300e+00],
[ 4.17793767e+00, 6.04290791e-02],
[-1.81287250e-01, 6.71872745e-01],
[ 1.33705960e+00, 1.04492404e-03],
[ 3.87525832e-01, -1.47473627e+00],
[-5.68614641e-01, 1.40464193e+00],
[-1.22710308e+00, 8.41111305e-01],
[-3.10421021e+00, -4.77352886e-02],
[-9.06691968e-01, 1.04600550e+00],
[-3.03054318e-02, -1.20383735e+00],
[-5.49816064e-01, -1.64369898e+00],
[ 1.62798732e+00, 3.39296439e-01],
[-2.21603801e+00, 5.12930712e-01],
[ 5.37359541e+00, 3.95953743e-01],
[-4.33276444e+00, 2.40976363e+00],
[ 1.59329862e+00, 1.37487471e+00],
[ 3.84986210e+00, 7.81664593e-01],
[ 3.61989016e+00, 1.71004173e-01],
[-1.13535844e+00, -5.02391417e-01],
[ 1.19789730e+00, -2.44059859e-01],
[ 2.09546690e+00, 6.22797015e-01],
[-7.61788381e-01, 7.64710667e-01],
[-2.55044921e-01, -2.00917469e-01],
[-1.14432769e+00, -8.01725558e-01],
[ 1.19099093e+00, 2.57593788e-01],
[ 4.25489781e-01, 2.89514482e-01],
[-4.16586267e-01, 5.85427198e-01],
[-1.51139763e-01, -7.44410317e-01],
[ 1.78476026e+00, 7.64334221e-01],
[ 4.41815024e-01, -7.46788241e-01],
[ 1.54065386e+00, -4.19883445e-01],
[ 9.11480557e-01, 7.62340541e-01],
[-2.17469657e+00, -2.25271922e-01],
[ 9.50845494e-01, -5.67983115e-01],
[ 2.65995183e+00, 1.47999176e-01],
[-5.56177937e-01, -1.55153160e+00],
[-5.47530011e-02, -1.04896657e+00],
[-1.49455359e+00, 1.11511490e+00],
[-6.99086103e-01, 8.53762995e-01],
[-1.41131205e+00, 2.59040455e-01],
[-1.97530806e-01, 9.44954036e-01],
[ 2.29436837e+00, -1.14446541e+00],
[-2.33809453e+00, -5.01934493e-01],
[ 1.41538464e+00, -1.83732321e-02],
[ 1.12519467e-01, -1.44334893e-01],
[ 8.13817016e-01, 1.27008292e+00],
[-6.33102836e-01, -9.86910770e-01],
[ 6.41872747e-01, 1.07260314e+00],
[ 2.14135859e+00, -1.92968726e+00],
[-1.15813831e+00, -9.14146181e-01],
[ 7.11885073e-01, 1.84897121e+00],
[ 1.62332945e+00, 7.42788577e-01],
[-1.68141608e+00, 1.16900093e+00],
[-5.32821171e+00, 6.67290840e-01],
[-2.44390245e+00, 4.73591045e-01],
[-8.58207885e-01, 1.20053359e+00]])
# Select optimal number of clusters, k
Sum_of_squared_differences_countries = []
K = range(1, 31)
for k in K:
kmeans_countries = KMeans(n_clusters=k)
kmeans_countries.fit(pca_countries)
Sum_of_squared_differences_countries.append(kmeans_countries.inertia_)
Sum_of_squared_differences_countries
[593.6468345005125,
304.3305477620855,
206.67194353683982,
148.95634380079338,
121.28174974759105,
97.43999961101579,
80.83841686606813,
68.90853268223835,
60.22874329306373,
54.14883465310518,
48.573875243346166,
45.43947822211138,
40.28743116061935,
37.842849347016006,
34.45415389681908,
33.05362269834717,
31.83094041245218,
28.85520473381342,
26.095529452652656,
24.52245402734946,
23.495979871278994,
22.14922258419889,
19.607214460726738,
18.627583942264145,
17.74833201042812,
16.640232091054198,
16.03881696384903,
14.79872891026243,
13.496850551112397,
13.02564592378948]
# Plot the number of clusters against the sum of squared differences
# Plot and font size
plt.figure(figsize=(11.7, 8.27))
sns.set(font_scale=1.5)
# Generate the plot
grid = sns.lineplot(x=K, y=Sum_of_squared_differences_countries)
# Add x and y labels
labels = grid.set(xlabel="Number of clusters, k", ylabel="Total squared distances")
# k = 4 appears optimal so we specify n_clusters=4 and run the KMeans algorithm
kmeans_countries_k4 = KMeans(n_clusters=4)
kmeans_countries_k4.fit(pca_countries)
KMeans(n_clusters=4)
# Labels of clusters each country belongs to
kmeans_countries_k4.labels_
array([3, 0, 2, 1, 2, 0, 2, 0, 0, 1, 2, 2, 2, 3, 1, 2, 3, 0, 2, 3, 1, 3,
2, 1, 1, 1, 1, 2, 2, 3, 2, 1, 1, 3, 1, 2, 0, 2, 2, 2, 2, 2, 0, 2,
1, 2, 1, 0, 2, 0, 0, 3, 1, 2, 2, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2, 3,
2, 2, 1, 3, 2, 1, 1, 2, 3, 1, 0, 1, 0, 1, 2, 0, 0, 3, 0, 2, 2, 1,
3, 1, 3, 1, 1, 1, 2, 1, 1, 0, 2, 2, 1, 0, 0, 2, 1, 2, 1, 1, 3, 2,
1, 2, 2, 0, 0, 0, 0, 1, 3, 1, 2, 0, 2, 0, 1, 2, 0, 1, 0, 3, 3, 0],
dtype=int32)
# Plot the clusters along the two principal components
sns.set(font_scale=1.3)
plt.figure(figsize=(20.7, 16.27))
grid = sns.scatterplot(
x=pca_countries[:, 0], y=pca_countries[:, 1], hue=kmeans_countries_k4.labels_
)
for label in range(0, mobility_trends_countries.shape[0]):
grid.text(
pca_countries[label, 0],
pca_countries[label, 1],
mobility_trends_countries.index[label],
)
# Add the cluster membership as a new column
mobility_trends_countries["clusters_countries_k4"] = kmeans_countries_k4.labels_
mobility_trends_countries
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters_countries_k4 | |
---|---|---|---|---|---|---|---|
country_region | |||||||
Afghanistan | 14.629630 | 37.950231 | 6.611236 | -4.188804 | -7.533917 | 4.368709 | 3 |
Angola | -12.054187 | -0.541706 | 6.636268 | -27.115071 | -11.890746 | 7.796813 | 0 |
Antigua and Barbuda | -18.242138 | -8.100629 | 33.349057 | -43.980952 | -33.560899 | 4.367725 | 2 |
Argentina | -41.844881 | -8.477221 | -59.743529 | -46.950210 | -12.739573 | 10.504889 | 1 |
Aruba | -19.974843 | -6.563941 | 12.723270 | -45.807128 | -21.149194 | 5.266667 | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
Venezuela | -30.187251 | -5.294821 | -25.338645 | -35.782869 | -20.547809 | 12.866534 | 1 |
Vietnam | -17.849583 | -1.788475 | -19.921904 | -16.383344 | -3.686304 | -3.852658 | 0 |
Yemen | 22.608782 | 35.235060 | 24.800839 | 10.733753 | -0.384462 | 2.203187 | 3 |
Zambia | 4.727092 | 12.515936 | 14.473795 | -5.911355 | -8.706175 | 8.505976 | 3 |
Zimbabwe | -10.236083 | 8.790144 | -8.785682 | -21.217305 | -8.079623 | 11.429731 | 0 |
132 rows × 7 columns
# Check in which cluster the United Kingdom was assigned
UK_cluster = mobility_trends_countries[
mobility_trends_countries.index == "United Kingdom"
]
UK_cluster
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters_countries_k4 | |
---|---|---|---|---|---|---|---|
country_region | |||||||
United Kingdom | -36.80968 | -8.658667 | 28.105415 | -38.142992 | -35.856338 | 12.764187 | 1 |
# Access the UK cluster label
UK_cluster.clusters_countries_k4[0]
1
# Identify which other countries were assigned to the same cluster as
# the United Kingdom. These countries were found to be similar to
# the United Kingdom in terms of mobility trends since mid-February 2020.
mobility_trends_countries[
mobility_trends_countries.clusters_countries_k4
== UK_cluster.clusters_countries_k4[0]
]
Retail_Recreation | Grocery_Pharmacy | Parks | Transit_stations | Workplaces | Residential | clusters_countries_k4 | |
---|---|---|---|---|---|---|---|
country_region | |||||||
Argentina | -41.844881 | -8.477221 | -59.743529 | -46.950210 | -12.739573 | 10.504889 | 1 |
Barbados | -28.920833 | -21.027198 | -2.568820 | -48.604196 | -35.294311 | 7.961740 | 1 |
Bolivia | -34.237756 | -16.888959 | -30.541595 | -40.778590 | -21.111781 | 13.924102 | 1 |
Cambodia | -18.744566 | -14.834839 | -21.219523 | -40.810991 | -23.098286 | 11.036915 | 1 |
Cape Verde | -37.539932 | -15.226971 | -31.050916 | -61.708897 | -35.132780 | 8.541126 | 1 |
Chile | -43.175436 | -24.218896 | -47.080019 | -42.491373 | -22.204740 | 16.739138 | 1 |
Colombia | -34.484833 | -11.973455 | -31.031449 | -35.618999 | -15.789791 | 12.373928 | 1 |
Costa Rica | -34.962151 | -15.964143 | -44.117530 | -41.059761 | -25.007968 | 11.380478 | 1 |
Dominican Republic | -30.151327 | -9.214685 | -32.295794 | -35.217581 | -32.856045 | 7.742787 | 1 |
Ecuador | -22.314371 | -8.216520 | -26.811120 | -31.441646 | -29.826379 | 13.833723 | 1 |
El Salvador | -25.504932 | -8.077880 | -30.267082 | -31.815097 | -28.766900 | 7.822666 | 1 |
Guatemala | -25.027598 | -15.172614 | -30.907618 | -36.827924 | -24.630356 | 9.426033 | 1 |
Honduras | -36.432134 | -14.375700 | -31.256232 | -38.933476 | -29.526529 | 5.587909 | 1 |
Ireland | -33.553143 | 2.208573 | 4.704640 | -35.992820 | -31.905896 | 11.153335 | 1 |
Jamaica | -29.791493 | -16.386174 | -29.762737 | -29.953326 | -28.777418 | 8.112504 | 1 |
Jordan | -18.520229 | 2.979689 | -23.446470 | -49.044259 | -26.888299 | 7.200291 | 1 |
Kuwait | -34.235828 | -21.931346 | -32.606566 | -21.464818 | -30.640898 | 14.061209 | 1 |
Malaysia | -32.561370 | -5.290696 | -26.189600 | -44.232086 | -24.315252 | 14.673892 | 1 |
Mauritius | -22.539723 | -14.343453 | -43.448380 | -40.304823 | -24.118148 | 4.000554 | 1 |
Mexico | -28.896777 | -4.104189 | -32.162622 | -29.086818 | -20.787698 | 10.111131 | 1 |
Morocco | -33.375498 | 2.070717 | -35.086155 | -29.600598 | -24.522410 | 12.061255 | 1 |
Myanmar (Burma) | -43.484728 | -27.064409 | -34.006752 | -41.780749 | -35.533865 | 17.929615 | 1 |
Nepal | -32.974104 | -9.229084 | -14.531873 | -19.846614 | -25.856574 | 10.705179 | 1 |
Oman | -26.370485 | -16.848849 | -34.493798 | -40.140449 | -25.352067 | 4.543478 | 1 |
Panama | -49.300797 | -27.804781 | -54.726096 | -48.637450 | -41.089641 | 21.383466 | 1 |
Paraguay | -26.095047 | -12.136182 | -36.109029 | -42.722443 | -14.520564 | 11.017186 | 1 |
Peru | -41.989582 | -25.619250 | -35.329888 | -50.049217 | -25.067323 | 17.806543 | 1 |
Philippines | -39.366559 | -15.754154 | -25.052964 | -47.620418 | -31.061475 | 19.073404 | 1 |
Portugal | -28.188907 | -1.650543 | -2.672356 | -32.557503 | -24.958716 | 11.770899 | 1 |
Puerto Rico | -22.803820 | -16.031771 | -35.964184 | -48.612320 | -24.612285 | 9.463872 | 1 |
Rwanda | -8.363755 | -20.352866 | 0.948637 | -26.371871 | -20.675099 | 18.381676 | 1 |
Singapore | -25.525896 | -2.109562 | -22.727092 | -32.498008 | -22.992032 | 18.470120 | 1 |
Slovenia | -30.619501 | -15.418783 | -6.228830 | -31.772042 | -26.431177 | 8.331623 | 1 |
South Africa | -19.719944 | -3.551126 | -21.925228 | -28.383172 | -19.965717 | 13.025299 | 1 |
Sri Lanka | -33.302789 | -16.434263 | -22.000000 | -29.625498 | -30.079681 | 17.145418 | 1 |
The Bahamas | -32.922096 | -18.009420 | -3.241427 | -52.817518 | -32.716393 | 4.891566 | 1 |
Trinidad and Tobago | -26.085657 | 1.838645 | -22.460159 | -42.029880 | -28.529880 | 8.752988 | 1 |
United Kingdom | -36.809680 | -8.658667 | 28.105415 | -38.142992 | -35.856338 | 12.764187 | 1 |
Venezuela | -30.187251 | -5.294821 | -25.338645 | -35.782869 | -20.547809 | 12.866534 | 1 |