Discovering patterns in data

This lab will first introduce you to key concepts in machine learning for social science. Our focus will be on a particular branch of machine learning called unsupervised learning, which includes techniques for clustering and dimensionality reduction. We will then focus on hands-on data analysis with the scikit-learn library for machine learning in Python. Our research objective is to group UK counties with similar mobility trends using two popular techniques of unsupervised learning: k-means clustering and Principal Components Analysis (PCA).

Key themes

  • Definition of machine learning.

  • Supervised and unsupervised learning.

  • Introduction to unsupervised learning techniques, including clustering (k-means) and dimensionality reduction (Principal Component Analysis (PCA)).

  • Hands-on machine learning with scikit-learn.

  • Data-informed model parameter selection.

Learning resources

M Molina & F Garip. 2019. Machine learning for sociology. Annual Review of Sociology. Link to an open-access version of the article available at the Open Science Framework.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane. 2021. Chapter 7: Machine Learning. In Big Data and Social Science (2nd edition).

What is Machine Learning? OxfordSparks.

Additional resources

Kosuke Imai. 2018. Chapter 3.7.3: The k-means algorithm. In Quantitative Social Science. Princeton University Press.

Jake VanderPlas. 2016. In Depth: k-Means Clustering. In Python Data Science Handbook.

Sebastian Raschka. 2018. Python Machine Learning. Packt Publishing.


Machine Learning: What is it? What is it good for?

Field of study that gives computers the ability to learn [from data] without being explicitly programmed.

—Arthur Samuel, 1959

Data science tasks we can solve using machine learning

  1. Pattern discovery using unsupervised machine learning

  2. Prediction using supervised machine learning

Unsupervised and Supervised learning

Two types of machine learning are often distinguished in the literature: unsupervised learning and supervised learning

  1. Unsupervised learning — no outcome variable / labeled data are available, and the structure of data is unknown. The goal of unsupervised learning is to explore the structure of data and discover hidden structures and meaningful information without the guidance of outcome variable / labeled data. To uncover such hidden structures in data, we use unsupervised learning techniques, including clustering (e.g., k-Means) and dimensionality reduction (e.g., Principal Component Analysis (PCA)).

Unsupervised Learning, Machine Learning’s course by Andrew Ng

  1. Supervised learning — learn a model from labeled training data or outcome variable that would enable us to make predictions about unseen or future data. The learning is called supervised because the outcome variable as well as labels (e.g., email Spam or Ham where ‘Ham’ is e-mail that is not Spam) that guide the learning process are already known.

Supervised learning, Machine learning’s course by Andrew Ng

In this lab, we will be focusing on unsupervised learning.

Research problem: clustering counties by mobility

Let’s formulate our simple research problem: to inform a public health intervention, we need to group a number of counties in the UK with similar mobility trends. We frame this problem as a clustering task and perform k-means clustering to sort the UK counties into clusters with similar mobility trends.

k-means clustering

Clustering is an exploratory data analysis (EDA) task that aims to group a set of observations into subgroups or clusters (without any prior information about cluster membership) such that observations assigned to the same cluster are more similar to each other than those in other clusters. To cluster observations in our mobility data, we will employ the k-means algorithm.

The k-means algorithm

The k-means algorithm is an iterative algorithm in which a set of operations are repeatedly performed until a noticeable difference in results is no longer produced. The goal of the algorithm is to split the data into k similar groups where each group is associated with its centroid, which is equal to the within-group mean. This is done by first assigning each observation to its closest cluster and then computing the centroid of each cluster based on this new cluster assignments. These two step are iterated until the cluster assignment no longer changes.

The k-means algorithm produces the prespecified number of clusters k and consists of the following steps:

  1. Choose the initial centroids of k clusters.

  2. Given the centroids, assign each observations to a cluster whose centroid is the closest to that observation.

  3. Choose the new centroid of each cluster whose coordinate equals the within-cluster mean of the corresponding variable.

  4. Repeat Steps 2 and 3 until cluster assignment no longer change.

—Kosuke Imai. 2018. Quantitative Social Science. Princeton University Press.

See also Jake VanderPlas’ Python Data Science Handbook.

On k-Means Advantages and Disadvantages, read here.

Recent applications of k-means clustering in social sciences

Garip, F. 2012. Discovering diverse mechanisms of migration: The Mexico–US Stream 1970–2000. Population and Development Review, 38(3), 393-433. Open access version.

Bail, C. A. (2008). The configuration of symbolic boundaries against immigrants in Europe. American Sociological Review, 73(1), 37-59.

Let’s get coding with scikit-learn

Scikit-learn is simple, efficient, and widely used library for machine learning in Python.

# Import libraries for today's lab

from sklearn.preprocessing import StandardScaler  # For standartising data
from sklearn.cluster import KMeans # For performing k-means
from sklearn.decomposition import PCA # For performing PCA

# Data analysis & visualisation
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme(font_scale=1.5)
%matplotlib inline

The k-means clustering algorithm in scikit-learn

The KMeans estimator class in scikit-learn allows you to set up the algorithm parameters before fitting the estimator to the data.

Parameters of the KMeans algorithm include:

  • n_clusters — Number of clusters k to form (same as the number of centroids to generate).

  • init (‘random’ or ‘k-means++’, default=’k-means++’) Method of selection of initial centroids. ‘random’ selects n_clusters observations (rows) at random from data for the initial centroids. ‘k-means++’ selects initial cluster centers in a way that speeds up convergence.

  • n_init (default = 10) — Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs. The best output is measured in terms of the sum of squared distances of samples to their closest cluster center.

  • max_iterint (default=300) — Maximum number of iterations of the k-means algorithm for a single run.

  • random_state (default=None) For computational reproducibility, determines random number generation for centroid initialization.

We instantiate the KMeans class with the following arguments:

kmeans = KMeans(n_clusters=3,
       init = 'k-means++',
       n_init=10,
       max_iter=300,
       random_state=0)

kmeans
KMeans(n_clusters=3, random_state=0)

Data preprocessing

We preprocess the data in a format expected by the scikit-learn library. As part of the data preprocessing, we first remove countries with one or more NaN (Not a Number) using the Pandas method dropna(). Although some scikit-learn functions, such as StandardScaler(), handle NaNs, others, such as fit(), may require fine-tuning, so we remove NaNs at this stage to avoid unexpected downstream problems.

Tip

scikit-learn works on any numeric data stored as NumPy arrays, SciPy sparse matrices, or (nowadays) pandas DataFrame. If needed, you can convert Pandas DataFrame into a NumPy array using the Pandas method to_numpy().

# Drop NaNs from the DataFrame
mobility_trends_UK_mean_NaNdrop = mobility_trends_UK_mean.dropna()
mobility_trends_UK_mean_NaNdrop
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential
sub_region_1
Aberdeen City -52.264192 -12.564045 18.693023 -46.984716 -43.305677 15.123043
Aberdeenshire -31.193622 -13.107865 17.706422 -40.653759 -38.587336 12.817768
Angus Council -28.943052 -7.624146 12.408537 -32.275626 -34.491266 11.363636
Antrim and Newtownabbey -32.066059 -8.863326 -29.134328 -55.330296 -34.794760 13.557692
Ards and North Down -29.938497 -1.697039 2.662037 -42.958333 -37.139738 13.314286
... ... ... ... ... ... ...
Windsor and Maidenhead -45.835991 -12.453303 -2.466970 -45.020501 -44.469432 17.438961
Wokingham -41.587699 -17.482916 28.176471 -51.943052 -45.740175 18.969697
Worcestershire -38.803991 -11.399360 23.924497 -36.042028 -34.537983 12.685216
Wrexham Principal Area -44.521640 -11.731207 -4.040146 -40.788155 -32.085153 11.788030
York -44.825991 -14.640625 -4.085648 -49.212581 -45.432314 14.630385

141 rows × 6 columns

Data standardisation

It is a good practice to standardise our input features or variables before applying the k-means algorithm. Standardisation of input features in a data set is a common requirement for many statistical and machine learning estimators. By standardising individual features, all features are converted to the same scale so that the output of the clustering procedure is not influenced by how individual features are measured.

In our example data, the six features are measured on a similar scale so one may argue that standardisation is not strictly necessary but we will perform it so that the procedure is part of your data analysis workflow.

The sklearn.preprocessing module includes StandardScaler among other methods for data scaling. The StandardScaler method calculates a standard score or z-score of a sample observation x as z = (x - M) / SD where M is the mean of the sample observations and SD is the standard deviation of the sample observations. In simple words, for each observation in a column, we subtract the mean and divide by the standard deviation of that column.

# Data standardisation
scaler = StandardScaler() # Initialising the scaler using the default arguments 
mobility_trends_UK_standardised = scaler.fit_transform(mobility_trends_UK_mean_NaNdrop) # Fit to input data (continuous variable) and return the standardised variables 
mobility_trends_UK_standardised
array([[-2.49843261e+00, -6.52348780e-01,  3.08966775e-01,
        -7.74189911e-01, -1.66343960e+00,  1.26238427e+00],
       [ 1.38534410e+00, -7.69349838e-01,  2.64734313e-01,
        -1.08346777e-01, -5.94368434e-01,  6.76195967e-02],
       [ 1.80017432e+00,  4.10453600e-01,  2.72133058e-02,
         7.72802976e-01,  3.33710002e-01, -6.86019257e-01],
       [ 1.22453441e+00,  1.43848278e-01, -1.83528509e+00,
        -1.65191607e+00,  2.64945129e-01,  4.51103517e-01],
       [ 1.61669173e+00,  1.68565038e+00, -4.09753181e-01,
        -3.50724849e-01, -2.66374829e-01,  3.24952139e-01],
       [ 1.72099145e+00, -5.12048798e-02,  2.01210895e-01,
         2.24593779e+00,  2.43732268e-01, -1.67760996e+00],
       [ 1.19417468e+00,  6.64638339e-01, -4.37407370e-01,
         6.26503688e-01,  9.87718367e-01, -8.36644024e-01],
       [-2.42805112e+00, -1.97723230e+00,  4.14151998e-01,
        -9.66018370e-01, -2.10769036e+00,  1.37179090e+00],
       [-2.20770304e-01, -9.25995883e-01, -4.07036708e-01,
        -8.30180439e-01,  5.17247902e-01,  5.58668604e-01],
       [-6.56565795e-01, -8.01320694e-01, -7.94619574e-01,
        -1.06479528e+00, -1.14200107e+00,  5.12794654e-01],
       [-3.29399441e-01, -1.56082709e-01, -2.21034468e+00,
         5.25563567e-01,  1.28306102e+00, -6.12253162e-01],
       [ 2.97464445e-01,  6.15308425e-01, -1.86588500e-01,
         1.82476833e+00,  1.16680386e+00, -1.43535480e+00],
       [ 6.10266585e-01,  4.53415326e-02, -6.30872044e-01,
        -9.08579433e-02,  6.95343977e-01, -2.40603308e-01],
       [-9.13856595e-01, -7.47613266e-01,  9.54113258e-02,
        -1.03693200e+00, -1.43686181e+00,  2.11724111e+00],
       [-3.55851166e-01,  5.36895095e-01, -3.85117290e-01,
        -7.61422968e-01,  3.42120094e-01,  6.48095266e-02],
       [-1.21294121e+00, -1.54619023e+00,  7.11595012e-01,
        -1.10944829e+00, -2.02754713e+00,  1.25823898e+00],
       [-2.31626130e+00, -1.55897048e+00,  1.51525103e+00,
        -1.89824262e+00, -1.78950569e+00,  1.84198948e+00],
       [-3.63683145e-01, -1.18998945e+00,  3.19708351e-01,
        -3.70825140e-01, -1.00322735e+00,  1.56778305e+00],
       [ 9.95706001e-01, -3.70249116e-01, -7.57062960e-01,
        -9.97221469e-02,  5.17742613e-01, -1.91848217e-01],
       [-1.04546664e+00, -1.00595752e+00,  2.63832740e-01,
        -4.46138118e-01, -7.68736679e-01,  1.23955122e+00],
       [-2.37227305e+00, -1.89050022e+00, -1.02329156e+00,
        -1.88622938e+00, -2.51376143e+00,  1.38514857e+00],
       [-3.71386306e-01,  1.92530112e+00,  1.92513961e-02,
        -2.56402935e-01,  6.41420443e-01, -6.77102473e-01],
       [ 1.68890913e+00,  2.39088026e+00, -5.53193337e-01,
         9.58102855e-01,  1.08270294e+00, -9.68372902e-01],
       [ 3.96283597e-01,  1.56264428e-01,  7.32317813e-01,
         8.38992098e-01, -1.15487876e-01,  9.15094655e-01],
       [-3.69658546e-01, -3.53639079e-01, -3.76301976e-01,
        -6.11098945e-01, -6.52256419e-01, -1.13494868e+00],
       [-1.42308341e-01, -4.92511172e-01,  5.38929572e-01,
        -2.28160934e-02, -3.79211935e-01,  4.25670959e-01],
       [-5.37789118e-01, -2.84319767e-01,  6.85387095e-02,
        -2.65219848e-01, -5.07491331e-01,  2.97219359e-01],
       [ 1.03065263e-01,  2.23993460e+00,  7.25278653e-01,
         1.05334304e+00,  1.82823049e-01, -7.69498868e-01],
       [ 2.59940804e+00, -5.02292007e-01,  2.27907677e+00,
         1.54082753e+00, -3.38502377e-01, -1.33355748e+00],
       [ 5.25242280e-01,  2.09033584e-01,  6.29923549e-01,
         7.74642648e-01,  6.81364117e-01, -4.88010861e-01],
       [ 7.58613405e-01,  6.92649697e-01,  7.35710019e-01,
         1.21313144e+00,  3.98189517e-02, -7.87599139e-01],
       [ 2.71665659e-03, -4.30529363e-01, -1.40637686e+00,
        -7.89453017e-01,  6.34989196e-01, -1.61201105e-01],
       [ 8.22300251e-01,  1.01031557e+00, -6.35070060e-01,
         4.18953551e-01,  8.00335211e-01, -1.14272683e+00],
       [-4.40463747e-01,  2.62820227e-01,  7.01144550e-01,
        -1.38048239e+00, -1.07077784e-01, -1.69584637e-01],
       [-3.58763202e-01,  6.85869588e-01,  9.92414134e-01,
         1.33221204e+00,  3.33158245e-01, -1.12614076e-01],
       [ 8.21880382e-01,  1.54989730e+00, -1.00948560e+00,
         1.21882943e-01,  7.63119427e-01, -7.22340063e-01],
       [ 6.41921784e-01,  6.74930524e-01,  1.49723149e+00,
         1.74863593e+00,  5.70498252e-01, -6.82729467e-01],
       [ 1.69177080e-01,  5.20470614e-01,  2.21415871e+00,
         1.16033766e+00, -4.73634265e-02, -3.51368143e-01],
       [ 7.05156899e-01,  1.23183323e+00,  2.35433663e-01,
         1.23781431e+00,  1.30186005e+00, -1.61541359e+00],
       [-5.90249011e-01, -9.76972256e-01, -3.39483746e-01,
        -2.72454331e-01, -8.62996680e-01, -3.36788235e-01],
       [ 5.15560085e-01,  9.83360992e-01, -4.11318684e-01,
         1.50206340e+00, -5.11256932e-01,  4.70406332e-01],
       [ 2.56507205e+00,  1.39797147e+00, -2.77698616e+00,
        -2.18299439e-01, -1.42399932e+00,  1.53120784e+00],
       [ 4.99838282e-01,  7.63149869e-01,  2.01849037e+00,
         1.40407802e+00,  9.45358844e-01, -6.64492190e-01],
       [ 3.85694866e-01,  3.97205051e-01,  1.32369833e+00,
         6.71551290e-01,  6.56626909e-02, -2.75123183e-01],
       [-2.99396368e+00, -1.57241714e+00, -4.46555554e-01,
        -2.36834591e+00, -3.49176895e+00,  2.58589179e+00],
       [ 3.47341070e-01, -1.12951007e-01,  5.56196817e-01,
        -7.26491527e-01, -3.14795849e-01,  6.70782852e-01],
       [ 5.98090395e-01, -8.07893514e-01, -1.42414639e-01,
        -7.60943822e-01, -1.55561897e-02,  2.42173904e-01],
       [ 5.22094170e-01,  1.75769262e+00, -1.57646845e+00,
         3.33797358e-01,  1.25882017e+00, -1.16573187e+00],
       [ 7.53163698e-01,  8.09032865e-01,  6.29515046e-01,
         2.18665267e-01,  4.03734726e-01, -3.12646502e-01],
       [ 3.05022081e-01, -8.15244763e-01, -8.88838682e-01,
        -2.06571736e-01,  8.63543422e-02, -1.55346066e-01],
       [-2.12518578e+00, -1.45278823e+00, -6.17953788e-01,
        -2.33071952e+00, -1.74262752e+00,  1.02548409e+00],
       [-2.45784485e-01, -2.84838372e-01,  6.30476850e-01,
         5.79029488e-01,  4.65130290e-04,  4.09360789e-01],
       [-1.66019597e+00, -1.34124903e+00,  8.91555661e-01,
        -1.38223300e+00, -2.54572823e+00,  2.33390308e+00],
       [-1.52857345e-01, -5.66193815e-01,  5.13392550e-01,
        -4.71697881e-01, -2.51814353e-01,  3.10318376e-01],
       [ 5.60302217e-01,  2.38434358e-01, -2.54364769e-01,
         1.20294269e+00,  4.00907560e-01, -1.32111312e+00],
       [-3.24001935e-01, -1.17564311e-01,  8.09809636e-01,
         1.33328164e-01, -1.51161460e-01,  8.41790869e-01],
       [-6.62775145e-01, -7.09738174e-03, -2.62729318e+00,
         1.04855158e+00,  1.56863988e+00, -1.27582918e+00],
       [-1.44237369e-01, -7.63786016e-01,  3.93005769e-01,
         9.07682615e-01,  1.57147772e+00, -8.53604285e-01],
       [-5.93941903e-01, -3.35117372e-01,  4.00390409e-01,
        -8.90563516e-01, -1.01686485e+00,  1.60401873e+00],
       [ 8.52530793e-01, -3.67446106e-01,  9.47008348e-01,
         2.97729577e-01, -1.26866237e-01, -1.26461846e+00],
       [ 1.90235411e-01,  5.62379427e-01, -3.05256783e+00,
         9.11422583e-01, -1.96004116e-01, -4.55452361e-01],
       [ 8.28981878e-01,  9.09505319e-01, -6.08987614e-01,
         1.25256532e+00,  1.13972095e+00, -1.36631435e+00],
       [ 1.18212768e+00,  1.36513589e+00,  9.83704547e-01,
         1.17744189e+00,  1.26723026e+00, -1.29542248e+00],
       [ 3.06276500e-01,  2.02249301e-02,  5.68865448e-01,
         1.83604344e-01,  8.70438543e-02,  3.06588999e-01],
       [-2.08471842e-01, -3.34396987e-01, -1.38502311e-01,
         5.93831416e-02,  1.28547127e+00, -1.27659199e+00],
       [ 3.37161414e-01,  5.49291405e-02,  1.25509087e+00,
         6.53068302e-01,  7.21842957e-01, -2.35225508e-01],
       [-1.56546558e+00, -1.26407130e+00,  8.61239370e-01,
        -9.11996981e-01,  2.52135944e-01, -3.29401058e-01],
       [-1.93640219e-02,  3.97907377e-02,  2.46485307e-01,
         3.62153701e-01, -1.52298951e-01,  6.00762672e-01],
       [ 1.05424854e+00,  1.25151786e+00,  8.54668093e-01,
         1.02522369e+00,  1.46140971e+00, -1.08937685e+00],
       [ 6.92980708e-01, -6.78677124e-02, -2.04653117e+00,
         4.22242896e-01, -5.35992498e-01,  1.13095801e+00],
       [ 3.01663132e-01, -3.57997033e-01, -1.08715575e+00,
        -2.45810527e+00,  1.27910333e+00, -2.39033293e-01],
       [-2.29110681e-01,  2.81350711e-01,  5.17041028e-01,
        -7.97753349e-02, -1.57057578e-01,  1.68782233e-01],
       [ 5.92613830e-02, -3.83973456e-02,  2.87276396e-01,
        -3.10889771e-01, -8.51279709e-02, -2.22682419e-01],
       [ 1.14014081e+00,  5.78552176e-01, -3.17425787e+00,
         4.09630578e-01,  1.99544532e+00, -5.57521500e-01],
       [ 1.29675182e+00,  1.48373605e+00, -9.29224750e-01,
         8.54976539e-01,  5.88981043e-01, -9.20637256e-02],
       [-6.22047887e-01,  7.27861981e-02, -1.00213645e+00,
        -7.00571408e-01,  1.30087063e+00, -1.23665020e+00],
       [ 6.80384649e-01,  1.03776024e+00, -2.69743257e-01,
         8.71745555e-01, -7.06173192e-01,  6.44597166e-01],
       [-2.91893589e-01, -1.40131998e+00,  2.48892653e-01,
        -9.76560624e-01, -2.91605106e-01,  8.95726935e-01],
       [-1.99240161e-01, -1.65671781e+00, -8.12095315e-01,
         5.54312336e-01, -5.35796396e-01,  7.76500483e-02],
       [ 7.77794175e-01, -5.41778275e-01, -1.42855415e-01,
         4.29139291e-01,  1.02772237e+00, -1.96805814e+00],
       [ 7.86611416e-01,  3.47739497e+00, -2.73195848e-01,
        -9.09958271e-01,  5.87496909e-01, -4.65916044e-01],
       [-4.85446486e-01, -3.88872282e-01, -2.60831897e-01,
        -4.80882902e-01,  2.00630253e-02,  1.90660379e-01],
       [ 1.26484180e+00,  2.13995761e+00, -2.69873047e-01,
         8.35562329e-01,  9.85739521e-01, -2.62201557e-01],
       [ 8.94434626e-01,  5.46505355e-02,  1.01002514e+00,
         7.82274452e-01,  8.96511417e-01, -6.76428844e-01],
       [ 1.49954837e+00,  2.15269977e+00, -9.28964168e-01,
         1.59070544e+00,  4.01980164e-01, -6.13444729e-01],
       [ 5.13696797e-01,  3.50173353e-01,  9.24042593e-03,
         1.85136094e+00,  1.91233382e+00, -1.93083044e+00],
       [ 7.36347950e-01, -4.73803288e-02, -2.54659347e+00,
        -2.20970456e-01,  2.85776660e-01, -1.20527586e-01],
       [ 8.00886950e-01,  7.82426834e-01, -4.65740623e-01,
         1.77421841e+00,  1.65310509e+00, -1.40009099e+00],
       [-6.15051153e-02, -8.22065301e-02,  8.94198917e-01,
        -1.61512916e+00,  4.03461894e-02,  4.08362949e-01],
       [ 5.86789425e-01,  1.91762545e-01,  7.40288246e-01,
         1.19873876e+00, -1.18060551e-02, -3.38860947e-01],
       [-9.76734296e-02, -1.59719404e-01, -4.41908685e-01,
         5.52217031e-01,  7.95549935e-01, -1.04973548e-01],
       [ 5.18889015e-01,  1.11861653e+00,  1.18620210e+00,
         1.38395388e+00,  1.11788513e-01, -6.86426305e-01],
       [-3.07381021e+00, -1.41012986e+00,  1.32453741e+00,
        -1.64619110e+00, -8.47603754e-01, -1.58504581e-01],
       [-3.11327625e-01, -1.84680242e-01,  1.76051523e-01,
         8.42576746e-01,  4.11249464e-01, -1.54355658e-01],
       [-5.39368701e-01, -1.07453292e+00,  1.35394013e-01,
        -5.07297040e-01, -1.04735587e+00,  1.45717321e+00],
       [ 9.50780056e-01,  1.62389988e+00,  5.58377041e-01,
         4.40787045e-01,  1.09437138e+00, -1.65813833e+00],
       [-7.00563323e-01, -1.06126659e+00,  2.41948107e-01,
         6.67811371e-02, -5.68148734e-01, -1.25443886e-01],
       [ 4.45223431e-03, -2.16551450e-01, -1.08883610e+00,
        -9.03250225e-01,  1.31373313e+00, -7.87916212e-01],
       [-9.09620586e-01, -9.85557750e-01,  8.08949859e-01,
         5.94250764e-01,  1.44731093e-02, -7.47443600e-01],
       [-8.81992315e-01, -1.08356378e+00,  1.32924848e-01,
         3.51707796e-01, -7.88789982e-01, -2.95934825e-02],
       [ 2.81509437e-01, -4.00144198e-01,  2.19292245e-01,
         2.40550009e-01,  1.02778998e+00, -6.24031652e-01],
       [-2.00007062e+00, -6.81404341e-01,  3.38562793e-01,
        -2.40642564e+00, -1.90238516e+00,  2.54365967e+00],
       [-6.84608315e-01,  4.77595014e-01, -5.88641094e-01,
         1.12593368e+00,  1.23111634e+00, -1.15364970e+00],
       [-5.65920412e-01,  1.74053971e+00, -3.08872935e-01,
        -1.96419271e+00, -6.61649173e-01,  6.29690687e-01],
       [-1.70171697e-02,  3.01357950e-01, -7.23110275e-01,
        -5.58983723e-01,  6.18663722e-01, -2.29191456e-01],
       [ 1.16895742e+00, -1.79606708e-01, -2.69230026e-02,
         4.22307574e-01,  4.81372089e-01, -9.08216462e-01],
       [ 5.63775182e-01,  1.84690491e-01,  6.11529009e-01,
         7.62043821e-01,  1.19309416e+00, -4.83542979e-01],
       [ 6.72407145e-01, -2.28794512e+00,  1.84117501e-02,
        -1.27123446e+00, -6.56702060e-01,  1.03276673e+00],
       [ 2.39621118e-01,  1.32769873e-01,  8.16376482e-01,
         2.02000727e+00,  9.17896123e-01, -6.86019257e-01],
       [-6.40424579e-02,  1.60511111e-01, -4.94346348e-01,
         3.90204782e-01, -1.06169175e-01, -9.02330206e-01],
       [-3.44779140e-02, -3.92820410e-01,  1.57764030e+00,
        -5.98289922e-01, -6.02163363e-01,  1.14240254e+00],
       [ 1.35876745e+00,  6.57410172e-01,  1.25160975e+00,
         1.53073997e-01,  2.61202063e-01,  2.72646009e-01],
       [-1.19568095e-01, -3.69939425e-02,  4.71556949e-01,
        -2.50525319e-01,  5.94639517e-01, -5.02084688e-01],
       [-2.01711838e+00, -6.05667704e-01, -1.68801361e-01,
        -1.03313936e+00, -1.12627968e+00,  1.71040382e-01],
       [-2.76495992e-01,  7.27537503e-01,  9.18542293e-01,
        -1.05416804e+00, -6.20093422e-01,  6.96458798e-01],
       [ 8.07199790e-02, -2.06632216e-01,  3.50041182e-01,
         6.26161277e-01,  5.21808402e-01, -1.41119953e-01],
       [-1.36815447e+00,  2.07559109e-01, -1.04228246e-01,
        -1.18930047e+00, -1.37792468e+00,  2.53912011e-01],
       [ 2.82644883e-01,  9.52048010e-01,  7.68737946e-02,
         3.32946818e-01,  9.62488089e-01, -6.10080854e-01],
       [-1.85088554e-01, -1.15644752e-02,  2.37115889e-01,
        -7.70526744e-01,  1.14293860e+00, -1.06548458e+00],
       [ 4.02548874e-02, -4.60119564e-02,  6.76374959e-01,
         1.03071820e+00,  4.51568133e-01, -3.30061478e-01],
       [-7.64171885e-01, -9.56731742e-01,  5.35619067e-01,
        -7.50653714e-01, -1.52038368e+00,  2.15529889e+00],
       [-5.39634731e-01, -8.77203921e-01,  6.54206456e-01,
        -9.08520833e-01, -5.60728064e-01, -2.37557699e-03],
       [-5.47754972e-01, -1.18390329e+00, -8.38029318e-01,
        -8.14368616e-01,  7.99230950e-02,  3.31766365e-01],
       [-3.10429358e-01, -8.51282519e-01, -6.34055819e-01,
         7.23385634e-01,  5.22168573e-03,  3.19110813e-02],
       [ 4.36860835e-01,  2.05419303e+00,  1.47127587e+00,
         1.16690068e+00,  7.50256933e-01, -1.24017495e+00],
       [-5.78514417e-01, -1.48332935e-01,  3.43385103e-01,
        -1.16474087e-01, -4.95289986e-01,  4.79077838e-02],
       [ 1.51172457e+00,  7.85857417e-01,  1.46377429e-01,
        -4.38718042e-01, -5.25603560e-01,  8.46975243e-01],
       [-2.78061714e-01, -7.02551227e-01,  9.28798884e-01,
        -2.98088649e-01, -2.30260902e-01,  2.15281599e-01],
       [-4.86692947e-01,  2.77971037e-01,  2.41681787e-01,
        -2.92355291e-02, -1.33518433e-01,  6.74621059e-01],
       [-9.75157417e-01, -1.28278424e+00, -5.40121013e-02,
        -3.07301323e-01, -1.16279174e+00,  2.01398036e+00],
       [ 7.40845734e-01,  7.44690419e-01, -3.23029934e+00,
         3.99308559e-01,  6.40923328e-02,  2.68639483e-02],
       [ 1.15346895e-01,  1.52245264e+00, -1.01520856e+00,
        -3.66618467e-01, -3.31676723e-01,  3.98823052e-01],
       [-4.54069436e-01, -4.34319587e-01,  2.57671043e-01,
        -6.60323197e-01,  2.36916642e-02, -5.10287756e-02],
       [ 3.72745246e-02, -6.35042192e-02,  1.31879891e+00,
        -9.72265570e-01, -1.14948999e-01,  4.00101180e-01],
       [-3.41748013e-01, -2.66390886e-01,  3.46414332e-01,
        -4.12779885e-01,  3.25197532e-02,  1.01563370e-01],
       [ 3.72302411e-01, -9.61753268e-01,  1.95991553e-01,
         3.91618263e-01,  7.64010938e-02,  1.62065935e-01],
       [-1.31357154e+00, -6.28523021e-01, -6.39702852e-01,
        -5.67608354e-01, -1.92712073e+00,  2.46266529e+00],
       [-5.30516522e-01, -1.71062697e+00,  7.34139797e-01,
        -1.29567091e+00, -2.21504272e+00,  3.25600636e+00],
       [-1.74169030e-02, -4.01770804e-01,  5.43510351e-01,
         3.76680856e-01,  3.23125151e-01, -1.07868607e-03],
       [-1.07130734e+00, -4.73166612e-01, -7.10233303e-01,
        -1.22481589e-01,  8.78881876e-01, -4.66066987e-01],
       [-1.12740604e+00, -1.09911786e+00, -7.12273309e-01,
        -1.00850024e+00, -2.14528842e+00,  1.00705281e+00]])

We now fit the k-means class we already created (kmeans) to our data. This will perform 10 runs of the k-means algorithm (each with a different centroid seed) on your data with a maximum of 300 iterations per run:

kmeans.fit(mobility_trends_UK_standardised)
KMeans(n_clusters=3, random_state=0)

You can access estimator’s learned parameters using an underscore suffix ‘_’. For example, the attribute labels_ will display the cluster each observation or sample (in our example, county) belongs to. The labels of the clusters can be accessed by typing your k-means object (which we called ‘kmeans’) followed by a ‘.’ and the labels_ attribute.

kmeans.labels_
array([2, 0, 1, 0, 1, 1, 1, 2, 0, 0, 0, 1, 0, 2, 0, 2, 2, 2, 0, 2, 2, 1,
       1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       2, 0, 0, 1, 1, 0, 2, 0, 2, 0, 1, 0, 1, 1, 2, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 1, 2, 0, 2, 1, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 2, 0, 0, 2, 1, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0,
       0, 0, 0, 0, 2, 2, 0, 0, 2], dtype=int32)

The cluster labels indicate that, for example, the first county, Aberdeen City, is assign to cluster 2, the second county, Aberdeenshire, to cluster 0, and so on.

You can also access the coordinates of cluster centers using the cluster_centers_ attribute. This will show the means of the points in each cluster for each of the six variables.

kmeans.cluster_centers_
array([[-0.03601477, -0.25426318, -0.22566658, -0.31474479, -0.00550488,
         0.10455302],
       [ 0.69076554,  0.77912648,  0.20135277,  0.88575717,  0.71213695,
        -0.78596621],
       [-1.55607267, -1.11419698,  0.19192262, -1.18963518, -1.699088  ,
         1.57980498]])

You can include the cluster assignment as a column in your original DataFrame. Let’s name the new column ‘clusters’.

mobility_trends_UK_mean_NaNdrop['clusters'] = kmeans.labels_
mobility_trends_UK_mean_NaNdrop
<ipython-input-9-654f0b923fab>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mobility_trends_UK_mean_NaNdrop['clusters'] = kmeans.labels_
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters
sub_region_1
Aberdeen City -52.264192 -12.564045 18.693023 -46.984716 -43.305677 15.123043 2
Aberdeenshire -31.193622 -13.107865 17.706422 -40.653759 -38.587336 12.817768 0
Angus Council -28.943052 -7.624146 12.408537 -32.275626 -34.491266 11.363636 1
Antrim and Newtownabbey -32.066059 -8.863326 -29.134328 -55.330296 -34.794760 13.557692 0
Ards and North Down -29.938497 -1.697039 2.662037 -42.958333 -37.139738 13.314286 1
... ... ... ... ... ... ... ...
Windsor and Maidenhead -45.835991 -12.453303 -2.466970 -45.020501 -44.469432 17.438961 2
Wokingham -41.587699 -17.482916 28.176471 -51.943052 -45.740175 18.969697 2
Worcestershire -38.803991 -11.399360 23.924497 -36.042028 -34.537983 12.685216 0
Wrexham Principal Area -44.521640 -11.731207 -4.040146 -40.788155 -32.085153 11.788030 0
York -44.825991 -14.640625 -4.085648 -49.212581 -45.432314 14.630385 2

141 rows × 7 columns

Choosing the optimal number of clusters

In the example above, our choice of the number of clusters, k, was arbitrary. Let’s find a more informative method of choosing the optimal k for our data. One such method is the Elbow method for choosing optimal k.

Using the Elbow method, we run the k-means algorithm with various values of k and plot each value of k against the sum of squared distances between each data point (county in the UK) and its cluster centroid. For the case of k = 1 all data points will be assigned to the same cluster, resulting in higher sum of squared distances. As k increases, the sum of squared distances will be close to zero because each data point would be assigned to its own cluster.

We perform multiple runs of the k-means clustering algorithm using a for loop.

Performing for loop

A for loop is used to repeatedly execute a block of code, and is perfect fit for repeatedly executing the k-means algorithm. The for loop will iterate over a sequence of k values, and for each value of k will estimate the k-means algorithm.

Let’s first look at a simple example of a for loop:

for number in range (1,4):
    print(number)
1
2
3

In this example (and in for loops in general), there are two parts:

  • for loop statement, which in this example is ‘for number in range (1,4):’

    • number is the variable name; we could have specified a different variable name;

    • range (1,4) specifies the set of values to loop or iterate over; range (1,4) is the range of numbers 1, 2, 3. The first argument (1) is the starting point, and the second argument (4) is the endpoint (not included in the range)

    • the word ‘in’ connects the two components in the for loop statement

    • the for loop statement ends in a colon ‘:’.

  • the loop body, which contains the code to be executed at each iteration of the for loop. Each line in the loop body is indented four spaces, and this indentation is how the interpreter knows that a line is part of the loop or not. In our example, ‘print(number)’ is the loop body.

At each iteration of the for loop, the variable number is assigned the next number in the range from 1 to 3, and then the value of number is printed. The loop runs once for each number in the sequence from 1 to 3, so the body loop ‘print(number)’ executes 3 times.

This loop description draws on the Real Python’s book Python Basics (pages 153–154) and on the Kaggle’s Python tutorial.

Choosing k via for loop

We are now ready to apply the for loop to the k-means algorithm. In the code below, the for loop statement is ‘for k in K’ where k is the variable name and K is the set of values ranging from 1 to 30. The loop body contains the three lines of code related to the k-means() initiation, estimation, and output. Each of the three lines in the loop body are indented four spaces. The loop will run 30 times, so all three lines related to the k-means() algorithm will be executed 30 times.

# Run the k-means algorithm for values of k between 1 and 30

Sum_of_squared_distances = [] # Initialise a list

K = range(1,31) # range with a starting point 1 and endpoint 31, which is not included in the range
for k in K: # a for loop iterating over values of k ranging from 1 to 30
    kmeans = KMeans(n_clusters=k) # Initialise the KMeans estimator for a value of k
    kmeans.fit(mobility_trends_UK_standardised) # Perform the KMeans estimator by the fit() method 
    Sum_of_squared_distances.append(kmeans.inertia_) # Store the sum of squared distances (stored in kmeans.inertia_) for each run using the Python append() function
Sum_of_squared_distances
[845.9999999999998,
 546.9704597354307,
 439.3738733064078,
 367.85346505386616,
 332.54960654461104,
 306.1505168140231,
 287.190769961636,
 267.8459516425577,
 251.83066180489777,
 236.16710276426585,
 231.20748382284657,
 215.52703771662337,
 208.86030621637417,
 205.19668098978758,
 194.09347668977807,
 189.5736347321809,
 179.25001090494993,
 172.4717846005316,
 169.82837203694214,
 157.5266686276365,
 154.18453638634855,
 151.5352698509692,
 143.7723393526196,
 139.45624944572606,
 136.95912549105273,
 134.6455878317143,
 123.68017150386541,
 123.08064873633516,
 118.242380138667,
 117.53912607494914]

Elbow plot

Let’s plot k against the sum of squared distances. The plot below shows how the sum of squared distances varies with values of k between 1 and 30.

# Plot size
plt.figure(figsize=(8.2,5.8))

# Generate the plot
grid = sns.lineplot(x= K, y = Sum_of_squared_distances)   

# Add x and y labels
labels = grid.set(xlabel='Number of clusters, k', ylabel='Total squared distances')
_images/pattern_discovery_unsupervised_learning_33_0.png

For our data set, the elbow of the curve (where the curve “bends”) is not apparent but total squared distances seem to decrease slowly after k = 4. So we rerun our k-means algorithm with k = 4.

k = 4
kmeans_k4 = KMeans(n_clusters=k,
       init = 'k-means++',
       n_init=10,
       max_iter=300,
       random_state=0)

kmeans_k4.fit_transform(mobility_trends_UK_standardised)
array([[4.98019842, 3.0442409 , 1.17884131, 4.67341584],
       [2.50277416, 1.69943562, 3.66775794, 2.73598691],
       [1.28206283, 2.58234939, 5.17044226, 2.42380991],
       [3.76026654, 2.92111313, 4.33233442, 1.78038984],
       [2.49674367, 2.82893433, 4.75380754, 2.32502142],
       [2.09484569, 3.69870745, 6.16908843, 3.68843192],
       [1.08207851, 2.54667097, 5.24220245, 1.84340667],
       [5.76541539, 3.67063154, 1.34714234, 5.40801872],
       [3.04925128, 1.23501207, 2.87424362, 2.25584949],
       [3.83484031, 1.77771343, 1.82885527, 2.81161473],
       [3.0454869 , 3.02251859, 4.98335598, 1.60682124],
       [1.3943672 , 3.1715135 , 5.74118565, 2.90549834],
       [1.71855819, 1.54490594, 4.11935051, 1.16579998],
       [4.66724959, 2.6121972 , 0.96801755, 4.09864072],
       [2.35197866, 1.29542401, 3.33837074, 1.7252348 ],
       [4.97181712, 2.85888013, 0.89002067, 4.73894702],
       [6.05874052, 3.97906477, 1.76237025, 5.91882854],
       [3.87361422, 1.86522486, 1.61204309, 3.62856927],
       [2.04218298, 1.70022883, 4.14808509, 1.3366287 ],
       [3.77750313, 1.65788472, 1.34600128, 3.51598659],
       [6.42808864, 4.28060293, 1.98302866, 5.39177282],
       [2.03292741, 2.58797771, 4.70259283, 2.50180332],
       [2.19574476, 4.01256747, 6.49730877, 2.95433222],
       [2.0268262 , 1.66911012, 3.57529747, 2.97899773],
       [2.6911159 , 1.55173316, 3.33306976, 2.38613887],
       [2.40566884, 0.58282437, 2.63461603, 2.77181923],
       [2.59703968, 0.57017589, 2.37913711, 2.492484  ],
       [1.7146001 , 3.16630357, 5.32721082, 3.38060784],
       [3.22120313, 4.20617388, 6.3081662 , 5.11920602],
       [0.69937813, 1.79956028, 4.48056446, 2.56837667],
       [0.76942933, 2.35043744, 4.82906392, 2.95524105],
       [2.92461696, 1.88448925, 3.75379157, 1.21041823],
       [1.25825966, 2.59170928, 5.20586034, 1.70449921],
       [2.83121999, 1.42101339, 3.00522044, 2.93196451],
       [1.47825252, 2.17346662, 4.32071716, 3.25635459],
       [1.83303609, 2.82438455, 5.22282087, 1.49593282],
       [1.36677855, 2.97186609, 5.3515894 , 3.79845415],
       [2.08823966, 2.71294803, 4.66026391, 4.21983017],
       [1.18923238, 3.30372506, 5.97716757, 3.01576586],
       [3.03678872, 1.2617642 , 2.53921989, 2.67668344],
       [2.01990006, 2.49888413, 4.35227079, 2.55336364],
       [5.02503645, 4.82617507, 5.75593364, 3.62355982],
       [1.7143735 , 3.12562875, 5.46304566, 4.12143408],
       [1.33267908, 1.82902543, 4.16591312, 3.20194011],
       [7.71136196, 5.59485269, 2.88104587, 6.84729299],
       [2.60893044, 0.99146227, 2.77763588, 2.69490919],
       [2.68101589, 1.07306425, 3.10698154, 2.19919085],
       [2.41543587, 3.51050286, 5.85677167, 1.8654938 ],
       [0.94600529, 1.80420465, 4.38285512, 2.42538105],
       [2.51854261, 1.32881044, 3.44761633, 1.58513121],
       [5.80593777, 3.62060256, 1.64489342, 4.80691594],
       [2.00822647, 1.01882979, 3.15968603, 2.8134998 ],
       [5.9759439 , 3.92163193, 1.36910125, 5.6718435 ],
       [2.55655324, 0.44858371, 2.56793592, 2.72081263],
       [1.06062584, 2.41712595, 5.00575296, 2.32954646],
       [2.4424392 , 1.05811991, 2.75391227, 3.01774311],
       [3.55438435, 3.88421469, 5.81496343, 2.49511798],
       [1.94418067, 2.34259822, 4.81386788, 3.00438179],
       [3.84927487, 1.86568904, 1.45999249, 3.53921001],
       [1.6844009 , 1.99553843, 4.43892351, 3.10674291],
       [3.62219458, 3.67354291, 5.21566832, 1.95494218],
       [1.30749224, 3.10899885, 5.75570257, 2.31462874],
       [1.27004034, 3.42670878, 6.0778929 , 3.48823872],
       [1.6863795 , 0.91652296, 3.40506625, 2.46101685],
       [1.91209642, 2.0633706 , 4.5999936 , 2.31625751],
       [1.3135918 , 1.82430951, 4.3165167 , 3.12307202],
       [3.63063814, 1.95957693, 2.82842023, 3.80781312],
       [2.0014794 , 0.92186697, 3.07494181, 2.40231794],
       [1.1402732 , 3.23063247, 5.9049549 , 3.26336196],
       [3.48302201, 2.75737866, 3.91358239, 1.94228796],
       [3.98185504, 2.95095095, 4.40974771, 2.57738253],
       [1.93667153, 0.77291701, 3.06834082, 2.495929  ],
       [1.87424585, 0.57624753, 3.22517637, 2.20411276],
       [3.88989898, 4.42739318, 6.49934842, 2.26240291],
       [1.79108563, 2.98125545, 5.32500489, 1.85622777],
       [2.7296179 , 2.44347661, 4.56929573, 1.95522016],
       [2.12958996, 2.22874027, 3.99444372, 2.43885793],
       [3.61797778, 1.44440059, 2.04406595, 3.22609704],
       [3.22893117, 1.88058909, 3.12960781, 2.75245031],
       [1.92984639, 2.72078902, 5.34134181, 2.68507291],
       [3.39107419, 4.17153512, 6.02319556, 3.45529816],
       [2.55973632, 0.60251384, 2.69533964, 2.08148303],
       [1.76849821, 3.3778043 , 5.78472538, 2.60820977],
       [0.98610299, 2.21134746, 4.8740781 , 3.00038842],
       [2.22328097, 3.8744077 , 6.18230922, 2.81650734],
       [1.99493982, 3.76046541, 6.40907833, 3.46958427],
       [3.36370495, 2.96534683, 4.6542336 , 1.02651157],
       [1.66701529, 3.55362673, 6.20373824, 2.88395952],
       [3.14665538, 1.55682324, 2.89324754, 3.16482352],
       [1.08385587, 1.92415402, 4.33942652, 2.8975586 ],
       [1.67506471, 1.44931971, 3.94150278, 1.73564052],
       [1.1407652 , 2.70283698, 5.0316948 , 3.43942562],
       [5.39490432, 3.61543037, 2.76208411, 5.44495369],
       [1.55375355, 1.31316822, 3.74224079, 2.40626158],
       [3.88017446, 1.79182997, 1.3942146 , 3.50100757],
       [1.42686889, 3.28063201, 5.90438335, 3.09172385],
       [2.82729008, 1.05821929, 2.55003524, 2.9860587 ],
       [2.73007788, 2.24759682, 4.43244397, 1.5756055 ],
       [2.50605856, 1.68796818, 3.51422789, 3.41754967],
       [2.99598627, 1.36473969, 2.50066942, 3.13992529],
       [1.47081719, 1.55652587, 4.26913546, 2.25555811],
       [6.190376  , 4.09346965, 1.69022827, 5.50233109],
       [1.84950651, 2.67812943, 5.03212308, 2.42603285],
       [3.92079471, 2.87482851, 3.45827343, 3.18166716],
       [2.13558708, 1.46007277, 3.77517354, 1.19149612],
       [1.26208424, 1.98343159, 4.67869188, 2.12532522],
       [0.87846253, 2.07385831, 4.77368898, 2.62447   ],
       [4.40248846, 2.52951303, 2.78682449, 3.81449371],
       [1.40440313, 2.82103112, 5.23963312, 3.45032116],
       [1.62830971, 1.54419626, 3.93907093, 1.78691331],
       [3.31216518, 1.79302286, 2.54952976, 3.8891256 ],
       [1.76433876, 2.22353616, 4.48465064, 3.15499511],
       [1.67211354, 1.06678543, 3.71207287, 2.34508414],
       [4.19349492, 2.25652185, 1.71420681, 3.68821273],
       [3.01783348, 1.70086663, 2.74229095, 3.24891968],
       [1.35494269, 1.21226765, 3.83651479, 2.35989753],
       [3.82973598, 2.05957292, 1.93163038, 3.33852369],
       [0.90029761, 2.04534464, 4.67561013, 2.03701792],
       [2.145379  , 1.85597145, 4.2834288 , 2.42967178],
       [1.18199154, 1.64407468, 4.13846056, 2.78213607],
       [4.63072263, 2.6367416 , 1.15146107, 4.34026371],
       [3.13097095, 1.07532803, 2.27622541, 3.19911697],
       [3.39492703, 1.48101349, 2.63546399, 2.32985546],
       [2.41159682, 1.41314949, 3.35608214, 2.18124445],
       [1.77958759, 3.57646421, 5.92891749, 3.96984317],
       [2.37172928, 0.64220666, 2.61896312, 2.63985528],
       [2.60841689, 2.22451609, 3.93721255, 2.61863099],
       [2.58808846, 0.80246426, 2.67581449, 3.1174236 ],
       [2.32421824, 0.93976297, 2.77508092, 2.48136785],
       [4.46628193, 2.45877596, 1.29653502, 3.98220566],
       [3.81749023, 3.80663775, 5.33742191, 1.76575585],
       [2.66865907, 2.31431194, 3.89160875, 1.73626814],
       [2.50649625, 0.52044709, 2.75476441, 2.47182468],
       [2.76581658, 1.38529442, 2.97674449, 3.32630718],
       [2.27104667, 0.31404422, 2.82920397, 2.42793758],
       [2.15452406, 1.0716822 , 3.37706886, 2.51805017],
       [5.16575567, 3.27542673, 1.48480488, 4.36993203],
       [6.11821419, 4.14480923, 2.18828643, 5.70489708],
       [1.70759903, 0.90089965, 3.46994405, 2.55446285],
       [2.66925835, 1.74173047, 3.66471558, 2.1397296 ],
       [4.79422827, 2.75221174, 1.24960093, 3.93735977]])

Let’s view to which cluster each observation or sample (in our example, UK county) in our data set was assigned to:

kmeans_k4.labels_
array([2, 1, 0, 3, 3, 0, 0, 2, 1, 1, 3, 0, 3, 2, 1, 2, 2, 2, 3, 2, 2, 0,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 3, 0, 1, 0, 3, 0, 0, 0, 1, 0, 3, 0, 0,
       2, 1, 1, 3, 0, 1, 2, 1, 2, 1, 0, 1, 3, 0, 2, 0, 3, 0, 0, 1, 0, 0,
       1, 1, 0, 3, 3, 1, 1, 3, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 3, 0,
       1, 0, 1, 0, 2, 1, 2, 0, 1, 3, 1, 1, 0, 2, 0, 1, 3, 0, 0, 1, 0, 1,
       1, 0, 1, 2, 1, 1, 2, 0, 1, 0, 2, 1, 1, 1, 0, 1, 1, 1, 1, 2, 3, 3,
       1, 1, 1, 1, 2, 2, 1, 1, 2], dtype=int32)

Here are also the centers of the four detected clusters:

kmeans_k4.cluster_centers_
array([[ 0.69109462,  0.74435434,  0.40560323,  0.94545818,  0.67549184,
        -0.79707789],
       [-0.1880133 , -0.39069686,  0.20618377, -0.29358717, -0.12027395,
         0.1497298 ],
       [-1.55607267, -1.11419698,  0.19192262, -1.18963518, -1.699088  ,
         1.57980498],
       [ 0.53248676,  0.43544327, -1.71033125, -0.21185365,  0.55451495,
        -0.20660955]])

K-Means is an an iterative algorithm so a set of operations are repeatedly performed until sum of distances from each observation to its cluster centroid is minimised and the cluster assignment no longer updates. How many iterations were needed for the algorithm to converge in our case?

kmeans_k4.n_iter_
12

As we did earlier, we add the cluster assignment as a column to our DataFrame. We name the column ‘clusters_k4’.

# Add the 4-cluster assignment to your DataFrame
mobility_trends_UK_mean_NaNdrop['clusters_k4'] = kmeans_k4.labels_
mobility_trends_UK_mean_NaNdrop
<ipython-input-68-fbc6feb133eb>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mobility_trends_UK_mean_NaNdrop['clusters_k4'] = kmeans_k4.labels_
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters clusters_k4 clusters_k4_pca
sub_region_1
Aberdeen City -52.264192 -12.564045 18.693023 -46.984716 -43.305677 15.123043 2 2 1
Aberdeenshire -31.193622 -13.107865 17.706422 -40.653759 -38.587336 12.817768 0 1 2
Angus Council -28.943052 -7.624146 12.408537 -32.275626 -34.491266 11.363636 1 0 0
Antrim and Newtownabbey -32.066059 -8.863326 -29.134328 -55.330296 -34.794760 13.557692 0 3 3
Ards and North Down -29.938497 -1.697039 2.662037 -42.958333 -37.139738 13.314286 1 3 3
... ... ... ... ... ... ... ... ... ...
Windsor and Maidenhead -45.835991 -12.453303 -2.466970 -45.020501 -44.469432 17.438961 2 2 1
Wokingham -41.587699 -17.482916 28.176471 -51.943052 -45.740175 18.969697 2 2 1
Worcestershire -38.803991 -11.399360 23.924497 -36.042028 -34.537983 12.685216 0 1 2
Wrexham Principal Area -44.521640 -11.731207 -4.040146 -40.788155 -32.085153 11.788030 0 1 2
York -44.825991 -14.640625 -4.085648 -49.212581 -45.432314 14.630385 2 2 1

141 rows × 9 columns

Our next step will be to assess the way in which clusters are similar or different with respect to each mobility category. To accomplish this, we plot the clusters against each mobility category using the Seaborn function catplot.

# Create a variable 'mobility_category' 
mobility_categories = ['Retail_Recreation',
                 'Grocery_Pharmacy',
                 'Parks',
                 'Transit_stations',
                 'Workplaces',
                 'Residential']

# Use a for loop to plot the clusters across the six mobility categories

for mobility_category in mobility_categories:
  sns.catplot(x = 'clusters_k4',
              y = mobility_category,
              kind = 'swarm',
              data = mobility_trends_UK_mean_NaNdrop)
_images/pattern_discovery_unsupervised_learning_45_0.png _images/pattern_discovery_unsupervised_learning_45_1.png _images/pattern_discovery_unsupervised_learning_45_2.png _images/pattern_discovery_unsupervised_learning_45_3.png _images/pattern_discovery_unsupervised_learning_45_4.png _images/pattern_discovery_unsupervised_learning_45_5.png

Dimensionality Reduction via PCA

The above plots visualise each variable separately. A more informative approach would be to take into account all six dimensions simultaneously. However, there is a difficulty in visualising and perceiving multidimensional data beyond two or three dimensions. One solution is to use the dimensionality reduction technique Principal Component Analysis (PCA).

We can apply the PCA to reduce the six mobility trends to just 2 dimensions, and then use those 2-dimensional approximations to visualise our clusters using a scatter plot.

The sklearn library is very consistent so the workflow we used to run k-means applies to PCA too. We first initialise the PCA estimator using the default arguments except for n_components where we specify to keep only 2 components. Then we perform the estimator using the fit() method. Below we use the fit_transform() method to simultaneously fit the estimator to data and apply the dimensionality-reduction transformation to data.

# We reuse our standardised data set
mobility_trends_UK_standardised
array([[-2.49843261e+00, -6.52348780e-01,  3.08966775e-01,
        -7.74189911e-01, -1.66343960e+00,  1.26238427e+00],
       [ 1.38534410e+00, -7.69349838e-01,  2.64734313e-01,
        -1.08346777e-01, -5.94368434e-01,  6.76195967e-02],
       [ 1.80017432e+00,  4.10453600e-01,  2.72133058e-02,
         7.72802976e-01,  3.33710002e-01, -6.86019257e-01],
       [ 1.22453441e+00,  1.43848278e-01, -1.83528509e+00,
        -1.65191607e+00,  2.64945129e-01,  4.51103517e-01],
       [ 1.61669173e+00,  1.68565038e+00, -4.09753181e-01,
        -3.50724849e-01, -2.66374829e-01,  3.24952139e-01],
       [ 1.72099145e+00, -5.12048798e-02,  2.01210895e-01,
         2.24593779e+00,  2.43732268e-01, -1.67760996e+00],
       [ 1.19417468e+00,  6.64638339e-01, -4.37407370e-01,
         6.26503688e-01,  9.87718367e-01, -8.36644024e-01],
       [-2.42805112e+00, -1.97723230e+00,  4.14151998e-01,
        -9.66018370e-01, -2.10769036e+00,  1.37179090e+00],
       [-2.20770304e-01, -9.25995883e-01, -4.07036708e-01,
        -8.30180439e-01,  5.17247902e-01,  5.58668604e-01],
       [-6.56565795e-01, -8.01320694e-01, -7.94619574e-01,
        -1.06479528e+00, -1.14200107e+00,  5.12794654e-01],
       [-3.29399441e-01, -1.56082709e-01, -2.21034468e+00,
         5.25563567e-01,  1.28306102e+00, -6.12253162e-01],
       [ 2.97464445e-01,  6.15308425e-01, -1.86588500e-01,
         1.82476833e+00,  1.16680386e+00, -1.43535480e+00],
       [ 6.10266585e-01,  4.53415326e-02, -6.30872044e-01,
        -9.08579433e-02,  6.95343977e-01, -2.40603308e-01],
       [-9.13856595e-01, -7.47613266e-01,  9.54113258e-02,
        -1.03693200e+00, -1.43686181e+00,  2.11724111e+00],
       [-3.55851166e-01,  5.36895095e-01, -3.85117290e-01,
        -7.61422968e-01,  3.42120094e-01,  6.48095266e-02],
       [-1.21294121e+00, -1.54619023e+00,  7.11595012e-01,
        -1.10944829e+00, -2.02754713e+00,  1.25823898e+00],
       [-2.31626130e+00, -1.55897048e+00,  1.51525103e+00,
        -1.89824262e+00, -1.78950569e+00,  1.84198948e+00],
       [-3.63683145e-01, -1.18998945e+00,  3.19708351e-01,
        -3.70825140e-01, -1.00322735e+00,  1.56778305e+00],
       [ 9.95706001e-01, -3.70249116e-01, -7.57062960e-01,
        -9.97221469e-02,  5.17742613e-01, -1.91848217e-01],
       [-1.04546664e+00, -1.00595752e+00,  2.63832740e-01,
        -4.46138118e-01, -7.68736679e-01,  1.23955122e+00],
       [-2.37227305e+00, -1.89050022e+00, -1.02329156e+00,
        -1.88622938e+00, -2.51376143e+00,  1.38514857e+00],
       [-3.71386306e-01,  1.92530112e+00,  1.92513961e-02,
        -2.56402935e-01,  6.41420443e-01, -6.77102473e-01],
       [ 1.68890913e+00,  2.39088026e+00, -5.53193337e-01,
         9.58102855e-01,  1.08270294e+00, -9.68372902e-01],
       [ 3.96283597e-01,  1.56264428e-01,  7.32317813e-01,
         8.38992098e-01, -1.15487876e-01,  9.15094655e-01],
       [-3.69658546e-01, -3.53639079e-01, -3.76301976e-01,
        -6.11098945e-01, -6.52256419e-01, -1.13494868e+00],
       [-1.42308341e-01, -4.92511172e-01,  5.38929572e-01,
        -2.28160934e-02, -3.79211935e-01,  4.25670959e-01],
       [-5.37789118e-01, -2.84319767e-01,  6.85387095e-02,
        -2.65219848e-01, -5.07491331e-01,  2.97219359e-01],
       [ 1.03065263e-01,  2.23993460e+00,  7.25278653e-01,
         1.05334304e+00,  1.82823049e-01, -7.69498868e-01],
       [ 2.59940804e+00, -5.02292007e-01,  2.27907677e+00,
         1.54082753e+00, -3.38502377e-01, -1.33355748e+00],
       [ 5.25242280e-01,  2.09033584e-01,  6.29923549e-01,
         7.74642648e-01,  6.81364117e-01, -4.88010861e-01],
       [ 7.58613405e-01,  6.92649697e-01,  7.35710019e-01,
         1.21313144e+00,  3.98189517e-02, -7.87599139e-01],
       [ 2.71665659e-03, -4.30529363e-01, -1.40637686e+00,
        -7.89453017e-01,  6.34989196e-01, -1.61201105e-01],
       [ 8.22300251e-01,  1.01031557e+00, -6.35070060e-01,
         4.18953551e-01,  8.00335211e-01, -1.14272683e+00],
       [-4.40463747e-01,  2.62820227e-01,  7.01144550e-01,
        -1.38048239e+00, -1.07077784e-01, -1.69584637e-01],
       [-3.58763202e-01,  6.85869588e-01,  9.92414134e-01,
         1.33221204e+00,  3.33158245e-01, -1.12614076e-01],
       [ 8.21880382e-01,  1.54989730e+00, -1.00948560e+00,
         1.21882943e-01,  7.63119427e-01, -7.22340063e-01],
       [ 6.41921784e-01,  6.74930524e-01,  1.49723149e+00,
         1.74863593e+00,  5.70498252e-01, -6.82729467e-01],
       [ 1.69177080e-01,  5.20470614e-01,  2.21415871e+00,
         1.16033766e+00, -4.73634265e-02, -3.51368143e-01],
       [ 7.05156899e-01,  1.23183323e+00,  2.35433663e-01,
         1.23781431e+00,  1.30186005e+00, -1.61541359e+00],
       [-5.90249011e-01, -9.76972256e-01, -3.39483746e-01,
        -2.72454331e-01, -8.62996680e-01, -3.36788235e-01],
       [ 5.15560085e-01,  9.83360992e-01, -4.11318684e-01,
         1.50206340e+00, -5.11256932e-01,  4.70406332e-01],
       [ 2.56507205e+00,  1.39797147e+00, -2.77698616e+00,
        -2.18299439e-01, -1.42399932e+00,  1.53120784e+00],
       [ 4.99838282e-01,  7.63149869e-01,  2.01849037e+00,
         1.40407802e+00,  9.45358844e-01, -6.64492190e-01],
       [ 3.85694866e-01,  3.97205051e-01,  1.32369833e+00,
         6.71551290e-01,  6.56626909e-02, -2.75123183e-01],
       [-2.99396368e+00, -1.57241714e+00, -4.46555554e-01,
        -2.36834591e+00, -3.49176895e+00,  2.58589179e+00],
       [ 3.47341070e-01, -1.12951007e-01,  5.56196817e-01,
        -7.26491527e-01, -3.14795849e-01,  6.70782852e-01],
       [ 5.98090395e-01, -8.07893514e-01, -1.42414639e-01,
        -7.60943822e-01, -1.55561897e-02,  2.42173904e-01],
       [ 5.22094170e-01,  1.75769262e+00, -1.57646845e+00,
         3.33797358e-01,  1.25882017e+00, -1.16573187e+00],
       [ 7.53163698e-01,  8.09032865e-01,  6.29515046e-01,
         2.18665267e-01,  4.03734726e-01, -3.12646502e-01],
       [ 3.05022081e-01, -8.15244763e-01, -8.88838682e-01,
        -2.06571736e-01,  8.63543422e-02, -1.55346066e-01],
       [-2.12518578e+00, -1.45278823e+00, -6.17953788e-01,
        -2.33071952e+00, -1.74262752e+00,  1.02548409e+00],
       [-2.45784485e-01, -2.84838372e-01,  6.30476850e-01,
         5.79029488e-01,  4.65130290e-04,  4.09360789e-01],
       [-1.66019597e+00, -1.34124903e+00,  8.91555661e-01,
        -1.38223300e+00, -2.54572823e+00,  2.33390308e+00],
       [-1.52857345e-01, -5.66193815e-01,  5.13392550e-01,
        -4.71697881e-01, -2.51814353e-01,  3.10318376e-01],
       [ 5.60302217e-01,  2.38434358e-01, -2.54364769e-01,
         1.20294269e+00,  4.00907560e-01, -1.32111312e+00],
       [-3.24001935e-01, -1.17564311e-01,  8.09809636e-01,
         1.33328164e-01, -1.51161460e-01,  8.41790869e-01],
       [-6.62775145e-01, -7.09738174e-03, -2.62729318e+00,
         1.04855158e+00,  1.56863988e+00, -1.27582918e+00],
       [-1.44237369e-01, -7.63786016e-01,  3.93005769e-01,
         9.07682615e-01,  1.57147772e+00, -8.53604285e-01],
       [-5.93941903e-01, -3.35117372e-01,  4.00390409e-01,
        -8.90563516e-01, -1.01686485e+00,  1.60401873e+00],
       [ 8.52530793e-01, -3.67446106e-01,  9.47008348e-01,
         2.97729577e-01, -1.26866237e-01, -1.26461846e+00],
       [ 1.90235411e-01,  5.62379427e-01, -3.05256783e+00,
         9.11422583e-01, -1.96004116e-01, -4.55452361e-01],
       [ 8.28981878e-01,  9.09505319e-01, -6.08987614e-01,
         1.25256532e+00,  1.13972095e+00, -1.36631435e+00],
       [ 1.18212768e+00,  1.36513589e+00,  9.83704547e-01,
         1.17744189e+00,  1.26723026e+00, -1.29542248e+00],
       [ 3.06276500e-01,  2.02249301e-02,  5.68865448e-01,
         1.83604344e-01,  8.70438543e-02,  3.06588999e-01],
       [-2.08471842e-01, -3.34396987e-01, -1.38502311e-01,
         5.93831416e-02,  1.28547127e+00, -1.27659199e+00],
       [ 3.37161414e-01,  5.49291405e-02,  1.25509087e+00,
         6.53068302e-01,  7.21842957e-01, -2.35225508e-01],
       [-1.56546558e+00, -1.26407130e+00,  8.61239370e-01,
        -9.11996981e-01,  2.52135944e-01, -3.29401058e-01],
       [-1.93640219e-02,  3.97907377e-02,  2.46485307e-01,
         3.62153701e-01, -1.52298951e-01,  6.00762672e-01],
       [ 1.05424854e+00,  1.25151786e+00,  8.54668093e-01,
         1.02522369e+00,  1.46140971e+00, -1.08937685e+00],
       [ 6.92980708e-01, -6.78677124e-02, -2.04653117e+00,
         4.22242896e-01, -5.35992498e-01,  1.13095801e+00],
       [ 3.01663132e-01, -3.57997033e-01, -1.08715575e+00,
        -2.45810527e+00,  1.27910333e+00, -2.39033293e-01],
       [-2.29110681e-01,  2.81350711e-01,  5.17041028e-01,
        -7.97753349e-02, -1.57057578e-01,  1.68782233e-01],
       [ 5.92613830e-02, -3.83973456e-02,  2.87276396e-01,
        -3.10889771e-01, -8.51279709e-02, -2.22682419e-01],
       [ 1.14014081e+00,  5.78552176e-01, -3.17425787e+00,
         4.09630578e-01,  1.99544532e+00, -5.57521500e-01],
       [ 1.29675182e+00,  1.48373605e+00, -9.29224750e-01,
         8.54976539e-01,  5.88981043e-01, -9.20637256e-02],
       [-6.22047887e-01,  7.27861981e-02, -1.00213645e+00,
        -7.00571408e-01,  1.30087063e+00, -1.23665020e+00],
       [ 6.80384649e-01,  1.03776024e+00, -2.69743257e-01,
         8.71745555e-01, -7.06173192e-01,  6.44597166e-01],
       [-2.91893589e-01, -1.40131998e+00,  2.48892653e-01,
        -9.76560624e-01, -2.91605106e-01,  8.95726935e-01],
       [-1.99240161e-01, -1.65671781e+00, -8.12095315e-01,
         5.54312336e-01, -5.35796396e-01,  7.76500483e-02],
       [ 7.77794175e-01, -5.41778275e-01, -1.42855415e-01,
         4.29139291e-01,  1.02772237e+00, -1.96805814e+00],
       [ 7.86611416e-01,  3.47739497e+00, -2.73195848e-01,
        -9.09958271e-01,  5.87496909e-01, -4.65916044e-01],
       [-4.85446486e-01, -3.88872282e-01, -2.60831897e-01,
        -4.80882902e-01,  2.00630253e-02,  1.90660379e-01],
       [ 1.26484180e+00,  2.13995761e+00, -2.69873047e-01,
         8.35562329e-01,  9.85739521e-01, -2.62201557e-01],
       [ 8.94434626e-01,  5.46505355e-02,  1.01002514e+00,
         7.82274452e-01,  8.96511417e-01, -6.76428844e-01],
       [ 1.49954837e+00,  2.15269977e+00, -9.28964168e-01,
         1.59070544e+00,  4.01980164e-01, -6.13444729e-01],
       [ 5.13696797e-01,  3.50173353e-01,  9.24042593e-03,
         1.85136094e+00,  1.91233382e+00, -1.93083044e+00],
       [ 7.36347950e-01, -4.73803288e-02, -2.54659347e+00,
        -2.20970456e-01,  2.85776660e-01, -1.20527586e-01],
       [ 8.00886950e-01,  7.82426834e-01, -4.65740623e-01,
         1.77421841e+00,  1.65310509e+00, -1.40009099e+00],
       [-6.15051153e-02, -8.22065301e-02,  8.94198917e-01,
        -1.61512916e+00,  4.03461894e-02,  4.08362949e-01],
       [ 5.86789425e-01,  1.91762545e-01,  7.40288246e-01,
         1.19873876e+00, -1.18060551e-02, -3.38860947e-01],
       [-9.76734296e-02, -1.59719404e-01, -4.41908685e-01,
         5.52217031e-01,  7.95549935e-01, -1.04973548e-01],
       [ 5.18889015e-01,  1.11861653e+00,  1.18620210e+00,
         1.38395388e+00,  1.11788513e-01, -6.86426305e-01],
       [-3.07381021e+00, -1.41012986e+00,  1.32453741e+00,
        -1.64619110e+00, -8.47603754e-01, -1.58504581e-01],
       [-3.11327625e-01, -1.84680242e-01,  1.76051523e-01,
         8.42576746e-01,  4.11249464e-01, -1.54355658e-01],
       [-5.39368701e-01, -1.07453292e+00,  1.35394013e-01,
        -5.07297040e-01, -1.04735587e+00,  1.45717321e+00],
       [ 9.50780056e-01,  1.62389988e+00,  5.58377041e-01,
         4.40787045e-01,  1.09437138e+00, -1.65813833e+00],
       [-7.00563323e-01, -1.06126659e+00,  2.41948107e-01,
         6.67811371e-02, -5.68148734e-01, -1.25443886e-01],
       [ 4.45223431e-03, -2.16551450e-01, -1.08883610e+00,
        -9.03250225e-01,  1.31373313e+00, -7.87916212e-01],
       [-9.09620586e-01, -9.85557750e-01,  8.08949859e-01,
         5.94250764e-01,  1.44731093e-02, -7.47443600e-01],
       [-8.81992315e-01, -1.08356378e+00,  1.32924848e-01,
         3.51707796e-01, -7.88789982e-01, -2.95934825e-02],
       [ 2.81509437e-01, -4.00144198e-01,  2.19292245e-01,
         2.40550009e-01,  1.02778998e+00, -6.24031652e-01],
       [-2.00007062e+00, -6.81404341e-01,  3.38562793e-01,
        -2.40642564e+00, -1.90238516e+00,  2.54365967e+00],
       [-6.84608315e-01,  4.77595014e-01, -5.88641094e-01,
         1.12593368e+00,  1.23111634e+00, -1.15364970e+00],
       [-5.65920412e-01,  1.74053971e+00, -3.08872935e-01,
        -1.96419271e+00, -6.61649173e-01,  6.29690687e-01],
       [-1.70171697e-02,  3.01357950e-01, -7.23110275e-01,
        -5.58983723e-01,  6.18663722e-01, -2.29191456e-01],
       [ 1.16895742e+00, -1.79606708e-01, -2.69230026e-02,
         4.22307574e-01,  4.81372089e-01, -9.08216462e-01],
       [ 5.63775182e-01,  1.84690491e-01,  6.11529009e-01,
         7.62043821e-01,  1.19309416e+00, -4.83542979e-01],
       [ 6.72407145e-01, -2.28794512e+00,  1.84117501e-02,
        -1.27123446e+00, -6.56702060e-01,  1.03276673e+00],
       [ 2.39621118e-01,  1.32769873e-01,  8.16376482e-01,
         2.02000727e+00,  9.17896123e-01, -6.86019257e-01],
       [-6.40424579e-02,  1.60511111e-01, -4.94346348e-01,
         3.90204782e-01, -1.06169175e-01, -9.02330206e-01],
       [-3.44779140e-02, -3.92820410e-01,  1.57764030e+00,
        -5.98289922e-01, -6.02163363e-01,  1.14240254e+00],
       [ 1.35876745e+00,  6.57410172e-01,  1.25160975e+00,
         1.53073997e-01,  2.61202063e-01,  2.72646009e-01],
       [-1.19568095e-01, -3.69939425e-02,  4.71556949e-01,
        -2.50525319e-01,  5.94639517e-01, -5.02084688e-01],
       [-2.01711838e+00, -6.05667704e-01, -1.68801361e-01,
        -1.03313936e+00, -1.12627968e+00,  1.71040382e-01],
       [-2.76495992e-01,  7.27537503e-01,  9.18542293e-01,
        -1.05416804e+00, -6.20093422e-01,  6.96458798e-01],
       [ 8.07199790e-02, -2.06632216e-01,  3.50041182e-01,
         6.26161277e-01,  5.21808402e-01, -1.41119953e-01],
       [-1.36815447e+00,  2.07559109e-01, -1.04228246e-01,
        -1.18930047e+00, -1.37792468e+00,  2.53912011e-01],
       [ 2.82644883e-01,  9.52048010e-01,  7.68737946e-02,
         3.32946818e-01,  9.62488089e-01, -6.10080854e-01],
       [-1.85088554e-01, -1.15644752e-02,  2.37115889e-01,
        -7.70526744e-01,  1.14293860e+00, -1.06548458e+00],
       [ 4.02548874e-02, -4.60119564e-02,  6.76374959e-01,
         1.03071820e+00,  4.51568133e-01, -3.30061478e-01],
       [-7.64171885e-01, -9.56731742e-01,  5.35619067e-01,
        -7.50653714e-01, -1.52038368e+00,  2.15529889e+00],
       [-5.39634731e-01, -8.77203921e-01,  6.54206456e-01,
        -9.08520833e-01, -5.60728064e-01, -2.37557699e-03],
       [-5.47754972e-01, -1.18390329e+00, -8.38029318e-01,
        -8.14368616e-01,  7.99230950e-02,  3.31766365e-01],
       [-3.10429358e-01, -8.51282519e-01, -6.34055819e-01,
         7.23385634e-01,  5.22168573e-03,  3.19110813e-02],
       [ 4.36860835e-01,  2.05419303e+00,  1.47127587e+00,
         1.16690068e+00,  7.50256933e-01, -1.24017495e+00],
       [-5.78514417e-01, -1.48332935e-01,  3.43385103e-01,
        -1.16474087e-01, -4.95289986e-01,  4.79077838e-02],
       [ 1.51172457e+00,  7.85857417e-01,  1.46377429e-01,
        -4.38718042e-01, -5.25603560e-01,  8.46975243e-01],
       [-2.78061714e-01, -7.02551227e-01,  9.28798884e-01,
        -2.98088649e-01, -2.30260902e-01,  2.15281599e-01],
       [-4.86692947e-01,  2.77971037e-01,  2.41681787e-01,
        -2.92355291e-02, -1.33518433e-01,  6.74621059e-01],
       [-9.75157417e-01, -1.28278424e+00, -5.40121013e-02,
        -3.07301323e-01, -1.16279174e+00,  2.01398036e+00],
       [ 7.40845734e-01,  7.44690419e-01, -3.23029934e+00,
         3.99308559e-01,  6.40923328e-02,  2.68639483e-02],
       [ 1.15346895e-01,  1.52245264e+00, -1.01520856e+00,
        -3.66618467e-01, -3.31676723e-01,  3.98823052e-01],
       [-4.54069436e-01, -4.34319587e-01,  2.57671043e-01,
        -6.60323197e-01,  2.36916642e-02, -5.10287756e-02],
       [ 3.72745246e-02, -6.35042192e-02,  1.31879891e+00,
        -9.72265570e-01, -1.14948999e-01,  4.00101180e-01],
       [-3.41748013e-01, -2.66390886e-01,  3.46414332e-01,
        -4.12779885e-01,  3.25197532e-02,  1.01563370e-01],
       [ 3.72302411e-01, -9.61753268e-01,  1.95991553e-01,
         3.91618263e-01,  7.64010938e-02,  1.62065935e-01],
       [-1.31357154e+00, -6.28523021e-01, -6.39702852e-01,
        -5.67608354e-01, -1.92712073e+00,  2.46266529e+00],
       [-5.30516522e-01, -1.71062697e+00,  7.34139797e-01,
        -1.29567091e+00, -2.21504272e+00,  3.25600636e+00],
       [-1.74169030e-02, -4.01770804e-01,  5.43510351e-01,
         3.76680856e-01,  3.23125151e-01, -1.07868607e-03],
       [-1.07130734e+00, -4.73166612e-01, -7.10233303e-01,
        -1.22481589e-01,  8.78881876e-01, -4.66066987e-01],
       [-1.12740604e+00, -1.09911786e+00, -7.12273309e-01,
        -1.00850024e+00, -2.14528842e+00,  1.00705281e+00]])
pca = PCA(n_components=2) # Initialise the Principal component analysis (PCA) algorithm with 2 components
pca_components = pca.fit_transform(mobility_trends_UK_standardised) # Apply the dimensionality reduction on the six mobility categories
pca_components # Transformed values arranged as observations/samples in rows and number of components in columns
array([[ 3.10066587, -0.31481648],
       [ 0.08817084, -0.2347937 ],
       [-1.76973924, -0.09080896],
       [ 0.11343852,  2.32664172],
       [-0.97864868,  0.81148741],
       [-2.60357739, -0.78192166],
       [-1.95538128,  0.39062258],
       [ 3.96296225, -0.52643396],
       [ 0.8264886 ,  0.54213837],
       [ 1.8230288 ,  0.8862566 ],
       [-1.0476404 ,  1.92709737],
       [-2.42129819, -0.26967682],
       [-0.72626121,  0.68151242],
       [ 2.84364859,  0.13429345],
       [ 0.11458069,  0.62192626],
       [ 3.24543493, -0.64006495],
       [ 4.27965497, -1.22403224],
       [ 2.03882393, -0.29187474],
       [-0.6201656 ,  0.77869446],
       [ 2.02450709, -0.26215724],
       [ 4.42828514,  1.08010449],
       [-1.13318337,  0.22326345],
       [-3.13141989,  0.64819894],
       [-0.07850344, -0.80141191],
       [ 0.33297606,  0.33400634],
       [ 0.68351857, -0.55717831],
       [ 0.85289085, -0.06506165],
       [-1.82466366, -0.7439942 ],
       [-1.93971673, -2.56243298],
       [-1.17420882, -0.75007061],
       [-1.49048393, -0.93985599],
       [ 0.06108238,  1.50537594],
       [-1.9073019 ,  0.61592368],
       [ 0.69916459, -0.32512211],
       [-0.86333871, -1.23089059],
       [-1.80220167,  1.14510504],
       [-1.83067518, -1.78499424],
       [-0.81304503, -2.36443585],
       [-2.71911752, -0.42475157],
       [ 1.01180926,  0.17515995],
       [-0.83753584,  0.17012825],
       [-0.34113965,  3.1110902 ],
       [-1.79698038, -2.17447202],
       [-0.71104924, -1.37865516],
       [ 5.84517766,  0.70575734],
       [ 0.70833849, -0.28855901],
       [ 0.51357156,  0.29866585],
       [-2.3258072 ,  1.61765103],
       [-1.0587594 , -0.51461027],
       [ 0.12567985,  0.81557464],
       [ 3.8200634 ,  0.8823183 ],
       [ 0.19596792, -0.76994137],
       [ 4.23937724, -0.68399429],
       [ 0.79763229, -0.42942418],
       [-1.69286821, -0.09076635],
       [ 0.64230962, -0.78227686],
       [-1.66267895,  2.14086659],
       [-1.15728939, -0.71584431],
       [ 2.04689236, -0.14574328],
       [-0.83213855, -1.06571801],
       [-0.99672759,  2.70374821],
       [-2.5001739 ,  0.35840278],
       [-2.74713061, -1.04745663],
       [-0.08892388, -0.53737382],
       [-1.02288406,  0.01277504],
       [-0.84312242, -1.32348051],
       [ 1.36656543, -0.85925751],
       [ 0.20220353, -0.28447863],
       [-2.58249782, -0.88376758],
       [ 0.21630311,  1.94572705],
       [ 0.302325  ,  1.68577295],
       [ 0.20332713, -0.44703536],
       [ 0.07801594, -0.21351392],
       [-2.30569019,  3.09890658],
       [-1.91931586,  0.946176  ],
       [-0.71194096,  1.05513416],
       [-0.47281934,  0.22373733],
       [ 1.697401  , -0.11108234],
       [ 0.76726255,  0.40197734],
       [-1.73054194, -0.09829325],
       [-1.87420856,  0.95304271],
       [ 0.64530415,  0.30841708],
       [-2.39791343,  0.4020531 ],
       [-1.44655902, -1.11198057],
       [-2.7557605 ,  0.79989079],
       [-2.99917922, -0.48823413],
       [-0.53862251,  2.52917058],
       [-2.91905913,  0.08687499],
       [ 0.9874331 , -0.41053848],
       [-0.97032903, -0.97909605],
       [-0.58973964,  0.27405037],
       [-1.59629107, -1.36862405],
       [ 3.03841669, -1.26382926],
       [-0.41795917, -0.42686102],
       [ 2.08739104, -0.0901682 ],
       [-2.53965286, -0.47189495],
       [ 0.93507641, -0.45599165],
       [-0.57941138,  1.23631726],
       [ 0.22438499, -1.16434199],
       [ 1.04362631, -0.44331397],
       [-0.83961966, -0.3068724 ],
       [ 4.32585235,  0.20519242],
       [-1.55268399,  0.23327268],
       [ 0.99409778,  0.99517549],
       [-0.31511152,  0.87484391],
       [-1.27960584, -0.07458619],
       [-1.42046143, -0.70874405],
       [ 1.99339307,  0.14447728],
       [-1.75946168, -1.28741417],
       [-0.60691971,  0.30825462],
       [ 1.34765171, -1.32868549],
       [-0.85546408, -1.02410614],
       [-0.31854633, -0.41200797],
       [ 2.1892709 ,  0.18283198],
       [ 0.96050483, -0.50259224],
       [-0.52267881, -0.50529065],
       [ 1.807458  ,  0.30140035],
       [-1.40004455, -0.03617891],
       [-0.61340337, -0.07748783],
       [-0.78409639, -0.9225624 ],
       [ 2.82037645, -0.37364667],
       [ 1.29326195, -0.55087716],
       [ 1.14712605,  0.85898696],
       [ 0.14271756,  0.29425895],
       [-2.39667593, -1.49598778],
       [ 0.64270297, -0.37110398],
       [-0.12892157,  0.22150424],
       [ 0.79961738, -0.90159014],
       [ 0.50551972, -0.17637091],
       [ 2.58058449,  0.02122199],
       [-0.99562927,  3.11526049],
       [-0.21820972,  1.26861145],
       [ 0.64198819, -0.16012142],
       [ 0.74976976, -0.97955944],
       [ 0.48868351, -0.26878224],
       [ 0.10743379, -0.36054286],
       [ 3.12144536,  0.70472146],
       [ 4.12388975, -0.43484051],
       [-0.11977148, -0.65146618],
       [ 0.03301601,  0.57357227],
       [ 2.84303782,  0.72319596]])

Now we can run the k-means algorithm on the two principal components:

k = 4
kmeans_k4_pca = KMeans(n_clusters=k,
       init = 'k-means++',
       n_init=10,
       max_iter=300,
       random_state=0)
kmeans_k4_pca.fit(pca_components)
KMeans(n_clusters=4, random_state=0)
# Labels of clusters to which each observation was assigned to
kmeans_k4_pca.labels_
array([1, 2, 0, 3, 3, 0, 0, 1, 2, 1, 3, 0, 3, 1, 2, 1, 1, 1, 3, 1, 1, 0,
       0, 2, 2, 2, 2, 0, 0, 0, 0, 3, 0, 2, 0, 3, 0, 0, 0, 2, 0, 3, 0, 0,
       1, 2, 2, 3, 0, 2, 1, 2, 1, 2, 0, 2, 3, 0, 1, 0, 3, 0, 0, 2, 0, 0,
       2, 2, 0, 3, 3, 2, 2, 3, 3, 3, 2, 2, 2, 0, 3, 2, 0, 0, 0, 0, 3, 0,
       2, 0, 2, 0, 1, 2, 1, 0, 2, 3, 2, 2, 0, 1, 0, 2, 3, 0, 0, 1, 0, 2,
       2, 0, 2, 1, 2, 2, 1, 0, 2, 0, 1, 2, 2, 2, 0, 2, 2, 2, 2, 1, 3, 3,
       2, 2, 2, 2, 1, 1, 2, 2, 1], dtype=int32)
# Add the 4-cluster assignment on the PCA components to your DataFrame
mobility_trends_UK_mean_NaNdrop['clusters_k4_pca'] = kmeans_k4_pca.labels_
mobility_trends_UK_mean_NaNdrop
<ipython-input-74-4a5504e65acb>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  mobility_trends_UK_mean_NaNdrop['clusters_k4_pca'] = kmeans_k4_pca.labels_
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters clusters_k4 clusters_k4_pca
sub_region_1
Aberdeen City -52.264192 -12.564045 18.693023 -46.984716 -43.305677 15.123043 2 2 1
Aberdeenshire -31.193622 -13.107865 17.706422 -40.653759 -38.587336 12.817768 0 1 2
Angus Council -28.943052 -7.624146 12.408537 -32.275626 -34.491266 11.363636 1 0 0
Antrim and Newtownabbey -32.066059 -8.863326 -29.134328 -55.330296 -34.794760 13.557692 0 3 3
Ards and North Down -29.938497 -1.697039 2.662037 -42.958333 -37.139738 13.314286 1 3 3
... ... ... ... ... ... ... ... ... ...
Windsor and Maidenhead -45.835991 -12.453303 -2.466970 -45.020501 -44.469432 17.438961 2 2 1
Wokingham -41.587699 -17.482916 28.176471 -51.943052 -45.740175 18.969697 2 2 1
Worcestershire -38.803991 -11.399360 23.924497 -36.042028 -34.537983 12.685216 0 1 2
Wrexham Principal Area -44.521640 -11.731207 -4.040146 -40.788155 -32.085153 11.788030 0 1 2
York -44.825991 -14.640625 -4.085648 -49.212581 -45.432314 14.630385 2 2 1

141 rows × 9 columns

Visualising mobility clusters

Let’s plot the resulting clusters along the two principal components using a scatter plot.

# Set figure size
plt.figure(figsize=(11.7,8.27))

# Scatterplot with the 1st principal component on the horizontal x axes and 2nd principal component on the vertical y axis
grid = sns.scatterplot(x = pca_components[:, 0], y = pca_components[:, 1], hue=kmeans_k4_pca.labels_, alpha=0.8, s=120)

# Add labels to the horisontal x axis and vertical y axis
labels = grid.set(xlabel='1st principal component', ylabel='2nd principal component')

# Plot the cluster centroids
sns.scatterplot(x = kmeans_k4_pca.cluster_centers_[:, 0], y = kmeans_k4_pca.cluster_centers_[:, 1],
                     hue=range(k), s=220, alpha=0.8, ec='black', legend=False)

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title='Cluster', loc='upper right')
<matplotlib.legend.Legend at 0x7f84e1e565e0>
_images/pattern_discovery_unsupervised_learning_54_1.png

In the figure above, we plot the 1st principal component against the 2nd principal component derived from the six mobility types. Each data point is a county in the UK. Larger dots represent the cluster centroid (which is typically not a data point). Colour scheme represents cluster assignment.

The figure above lacks county labels which we would need in order to interpret our results from k-means clustering.

Let’s add labels to data points so that we can associate each county with its name.

# Enlarge figure size
plt.figure(figsize=(32,24))

# Scatterplot with the 1st principal component on the horizontal x axes and 2nd principal component on the vertical y axis
grid = sns.scatterplot(x = pca_components[:,0], y = pca_components[:,1], hue = kmeans_k4_pca.labels_, alpha = 0.9, s = 120)

# Add labels to the horisontal x axis and vertical y axis
labels = grid.set(xlabel = '1st principal component', ylabel = '2nd principal component')

# Plot the cluster centroids
sns.scatterplot(x = kmeans_k4_pca.cluster_centers_[:, 0], y = kmeans_k4_pca.cluster_centers_[:, 1],
                     hue = range(k), s = 240, alpha = 0.8, ec = 'black', legend = False)

# This for loop assign country name to each data point iteratively
for line in range(0, mobility_trends_UK_mean_NaNdrop.shape[0]):
     grid.text(pca_components[line,0]+0.1, pca_components[line,1], # where the labels should be positioned
     mobility_trends_UK_mean_NaNdrop.index[line], # add labels to each data point 
     horizontalalignment = 'left', size = 'small', color = 'black', weight = None)

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title='Cluster', loc='upper right');
_images/pattern_discovery_unsupervised_learning_56_0.png

Because PCA transforms our six variables into a two-dimensional space, we cannot anymore see how a particular cluster or county is positioned with respect to any particular mobility category.

If you need to cluster counties with regard to any pair of variables, you could run k-means on particular pairs of variables and plot the cluster assignment for those variables. For example, below we run the k-means algorithm on two variables: retail and recreation mobility and workplaces mobility.

# We first fit k-means to two variables retail_recreation and workplaces using the standardised data
# We specify the number of clusters to be formed as k = 4  but keep in mind that we did not performed the Elbow method on these two variables in particular

k = 4
kmeans_k4_2vars = KMeans(n_clusters=k,
       init = 'k-means++',
       n_init=10,
       max_iter=300,
       random_state=0)

kmeans_k4_2vars.fit(mobility_trends_UK_standardised[:, [0,4]]) # 0 indicates the retail_recreation mobility variable and 4 indicates workplaces mobility variable   
KMeans(n_clusters=4, random_state=0)

Plot the resulting clusters along the two mobility variables — retail and recreation mobility and workplaces mobility — using a scatter plot.

# Plot the clusters 
plt.figure(figsize=(11.7,8.27))

grid = sns.scatterplot(x = mobility_trends_UK_standardised[:, 0], y = mobility_trends_UK_standardised[:, 4], hue = kmeans_k4_2vars.labels_, alpha = 0.8, s = 120)

# Plot the centers
sns.scatterplot(x = kmeans_k4_2vars.cluster_centers_[:, 0], y = kmeans_k4_2vars.cluster_centers_[:, 1],
                     hue = range(k), s = 220, alpha = 0.8, ec = 'black', legend = False)
grid.set(xlabel = 'Retail and Recreation Mean Change Mobility', ylabel = 'Workplaces Mean Change Mobility');

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title = 'Cluster', loc = 'upper right')
<matplotlib.legend.Legend at 0x7f8502c8db80>
_images/pattern_discovery_unsupervised_learning_60_1.png

Let’s add UK county labels in the figure below as we did before.

# Enlarge figure size
plt.figure(figsize=(28,22))

# Scatterplot with the 1st principal component on the horizontal x axes and 2nd principal component on the vertical y axis
grid = sns.scatterplot(x = mobility_trends_UK_standardised[:,0], y = mobility_trends_UK_standardised[:,4], hue=kmeans_k4_2vars.labels_, alpha=0.9, s=120)
grid.set(xlabel='Retail and Recreation Mean Change Mobility', ylabel='Workplaces Mean Change Mobility')

# Plot the cluster centroids
sns.scatterplot(x = kmeans_k4_2vars.cluster_centers_[:, 0], y = kmeans_k4_2vars.cluster_centers_[:, 1],
                     hue=range(k), s=240, alpha=0.8, ec='black', legend=False)

# This for loop assign country name to each data point iteratively
for line in range(0,mobility_trends_UK_mean_NaNdrop.shape[0]):
     grid.text(mobility_trends_UK_standardised[line,0]+0.1, mobility_trends_UK_standardised[line,4], # where the labels should be positioned
     mobility_trends_UK_mean_NaNdrop.index[line], # add labels to each data point iteratively 
     horizontalalignment='left', size='small', color='black', weight=None)

# Add title 'Cluster' to the legend and locate it in the upper right of the plot
plt.legend(title='Cluster', loc='upper right');
_images/pattern_discovery_unsupervised_learning_62_0.png

Hands-on exercise

You would like to know whether mobility trends in the UK over the last year of the pandemic were similar to the mobility trends in some other countries, and to which countries in particular.

To learn this, you use k-means clustering to group world countries in the COVID-19 Community Mobility Reports data set according to their mobility across mobility categories.

Write your Python code and Markdown below.

Below is a solution to the hands-on exercise.

# Compute mean mobility trends by country and remove NaN (Not a Number) values 
mobility_trends_countries = mobility_trends.groupby('country_region')[['Retail_Recreation',	
                                          'Grocery_Pharmacy',
                                          'Parks',
                                          'Transit_stations',
                                          'Workplaces',
                                          'Residential']].mean().dropna()
mobility_trends_countries.head()
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential
country_region
Afghanistan 11.746193 33.555838 4.195332 -5.002395 -8.276850 4.758950
Angola -14.925453 -3.229062 3.151481 -28.208609 -12.878265 8.627155
Antigua and Barbuda -20.449886 -9.677677 29.333713 -44.746019 -34.656949 5.708029
Argentina -42.767622 -9.831666 -60.221476 -47.383363 -13.202182 10.534182
Aruba -22.455581 -7.694761 8.833713 -46.858770 -22.462882 6.382353
# Data standardisation
scaler = StandardScaler()
StandardisedData= scaler.fit_transform(mobility_trends_countries)
StandardisedData
array([[ 2.58348699e+00,  2.73623230e+00,  3.65447602e-01,
         1.45767636e+00,  1.38114460e+00, -7.27286346e-01],
       [ 4.58024480e-01, -7.73121219e-02,  3.23107916e-01,
        -1.43228225e-01,  8.55192811e-01,  1.46324674e-01],
       [ 1.77825654e-02, -5.70543451e-01,  1.38508654e+00,
        -1.28407848e+00, -1.63415835e+00, -5.12942461e-01],
       [-1.76071706e+00, -5.82321548e-01, -2.24736523e+00,
        -1.46601838e+00,  8.18168328e-01,  5.77015282e-01],
       [-1.42051181e-01, -4.18877371e-01,  5.53585180e-01,
        -1.42982877e+00, -2.40349874e-01, -3.60650478e-01],
       [ 8.05551261e-01,  2.60725314e-01, -2.53305884e-01,
        -2.89220784e-01,  9.96224166e-01, -2.54929027e-01],
       [-1.00958319e+00, -4.83914715e-01,  1.32606199e-01,
        -4.59918978e-02, -8.30918839e-01, -6.35783062e-03],
       [-2.86250599e-01, -4.92500636e-01, -9.64992505e-01,
         4.19036867e-01,  1.79113835e-01,  5.59560817e-01],
       [ 1.97900291e-01,  5.40395414e-01,  9.37905564e-02,
         1.07395812e+00,  1.38840137e+00,  5.38144574e-01],
       [-8.16645587e-01, -1.54619333e+00, -1.08125352e-02,
        -1.60784307e+00, -1.83718943e+00,  3.46375919e-01],
       [ 3.99873134e-01,  2.46153561e-01,  6.94513574e-01,
         1.11707439e+00,  4.17818177e-01, -1.88529801e+00],
       [-7.50083615e-01, -1.62641673e-01,  1.62845457e+00,
        -2.54256591e-01, -9.84176768e-01,  7.76191677e-01],
       [-1.30871735e-01,  2.96726807e-02,  2.46593378e-02,
        -1.12220419e+00, -1.23969657e+00,  5.48218828e-02],
       [ 9.21453886e-01,  2.44346542e+00,  5.75038904e-01,
         1.15549735e+00,  1.32720301e+00, -9.38722072e-01],
       [-1.22124091e+00, -1.29144589e+00, -1.09056803e+00,
        -1.09432633e+00, -1.71233385e-01,  1.42501319e+00],
       [ 2.31390686e-01,  2.93792718e-01,  4.52283731e-01,
         5.09135001e-01,  4.05993814e-01, -1.80888096e+00],
       [ 1.22352579e+00,  1.64306280e+00,  1.02733006e+00,
         1.09120995e+00,  6.59546937e-01,  4.19423680e-02],
       [-4.20294357e-01,  8.08872176e-01, -1.09844863e+00,
        -1.06170135e-01,  1.65745177e+00,  7.06534994e-02],
       [ 4.03272812e-02,  2.74887053e-01,  5.72287417e-01,
         4.98930020e-01, -1.33105527e-01, -1.20850454e+00],
       [ 2.17018738e+00,  2.77183086e+00,  8.28257824e-01,
         2.00263930e+00,  1.72057271e+00, -1.39634372e+00],
       [ 1.80558665e-01, -9.13762249e-01, -6.35533795e-01,
        -9.75180158e-01, -2.09961302e-01,  5.92651644e-01],
       [ 1.13589264e+00,  6.17959346e-01, -4.75502653e-01,
         2.18634216e+00,  9.16834570e-01, -1.15157338e+00],
       [ 3.82143987e-01,  7.67407785e-02,  1.52227355e+00,
        -7.74562073e-01, -4.05074386e-01,  2.92593178e-01],
       [-1.46371720e+00, -1.09830212e+00, -1.14775551e+00,
        -2.47544909e+00, -1.72505036e+00,  2.53221909e-01],
       [-1.88017640e+00, -1.80154651e+00, -1.73176694e+00,
        -1.21206877e+00, -3.32503809e-01,  1.99010077e+00],
       [-1.28836480e+00, -9.53942477e-01, -1.13409694e+00,
        -8.14138566e-01,  3.82158765e-01,  1.07514004e+00],
       [-1.19736109e+00, -1.13768977e+00, -1.60846197e+00,
        -1.08795106e+00, -5.79616170e-01,  8.56954555e-01],
       [ 3.03589291e-01,  3.42904473e-01,  1.75697246e+00,
         3.56216952e-01, -1.22846299e-01, -8.79173252e-01],
       [-2.58766782e-01,  5.43885222e-01,  1.06588042e+00,
         8.45549006e-01,  1.61014157e-01, -1.09606840e-01],
       [ 1.49197726e+00,  1.47982677e+00, -1.11425398e-01,
         1.70840040e+00,  1.22072908e+00, -8.05068641e-01],
       [ 1.24272181e+00, -7.18619505e-02,  2.63048092e+00,
         2.44829181e-02, -6.18509857e-01, -8.97080013e-02],
       [-8.85531651e-01, -7.35722034e-01, -1.16363581e+00,
        -7.93292443e-01, -1.45473557e+00,  1.73668665e-01],
       [-3.57166103e-01, -7.05742626e-01, -9.70912643e-01,
        -5.47841679e-01, -1.20968354e+00,  1.49194346e+00],
       [-6.49376201e-02,  2.74713006e+00, -9.00769951e-02,
         1.85514480e-01,  8.66795787e-01, -9.37553579e-01],
       [-6.18067847e-01, -7.36275838e-01, -1.10634518e+00,
        -5.95230334e-01, -1.09582331e+00,  2.47245825e-01],
       [ 6.09219922e-01,  4.11143937e-01,  1.30750349e+00,
         8.77611307e-01, -3.20937069e-01, -5.63972921e-01],
       [ 1.05504136e+00,  5.22180511e-02, -3.73371061e-01,
         2.42490365e-02,  8.65824658e-01,  2.46104897e-01],
       [ 7.14774331e-01,  2.96188577e-01,  2.11332758e+00,
        -4.73793547e-01, -8.40067684e-02, -5.01973687e-01],
       [-7.15089514e-01, -7.51263647e-02,  1.25500655e+00,
         3.37692048e-01, -8.09030973e-01,  2.77918150e-01],
       [ 2.35292820e-01, -7.86900528e-01,  6.52207551e-01,
         1.10316854e+00,  3.68578968e-01, -2.12882806e-03],
       [-8.48536317e-01, -2.98926112e-01, -1.02538147e-01,
         4.05598828e-01, -9.06672238e-01, -3.36506126e-01],
       [-5.39777436e-01,  1.40176368e-02,  2.30630275e+00,
        -7.31583304e-02, -1.55083837e-01, -9.85591841e-02],
       [ 1.30783495e+00,  1.01964424e+00, -3.56284316e-01,
         1.41368459e+00,  8.92927680e-01,  7.14666334e-01],
       [-4.01549063e-01,  1.04895680e+00,  2.12159649e+00,
         3.10792580e-01, -3.49749308e-01, -4.45299038e-01],
       [-5.07604982e-01, -1.15883466e+00, -1.10724548e+00,
        -8.29118079e-01, -5.99723809e-01,  5.09544300e-01],
       [ 2.75457534e-01, -1.09230240e+00, -6.34803408e-01,
         3.69920205e-01, -6.05804862e-01, -6.37904715e-01],
       [-1.45028537e+00, -1.15961542e+00, -1.13589801e+00,
        -1.03725507e+00, -1.16226024e+00, -2.74646424e-01],
       [-4.27153167e-02,  1.32207321e+00, -5.93911420e-01,
         2.77942595e-01,  5.94198269e-01,  6.62748169e-01],
       [ 8.30596183e-01,  2.55471906e-01,  1.52505456e+00,
         9.71523653e-01,  7.49953478e-04, -4.00187877e-01],
       [-9.87972354e-01,  8.58965991e-01, -6.56194549e-01,
         2.56985624e-01,  3.84353031e-01,  1.03283955e+00],
       [ 2.99314347e-01,  1.55967754e-01, -3.51797861e-01,
        -3.80908002e-01, -1.16361530e-01, -2.87423779e-01],
       [ 9.52722270e-01,  1.82368551e+00, -4.49386971e-02,
         1.28091621e+00,  5.13398451e-01, -4.45055671e-01],
       [-1.22166610e+00,  1.67663809e-01,  2.07379005e-01,
        -7.67089998e-01, -1.39915880e+00,  8.95061807e-01],
       [-7.63643853e-01,  4.10543735e-02,  3.19826836e-01,
         4.85544843e-02, -7.96136031e-01,  5.60359199e-01],
       [-7.81741936e-01, -6.21036811e-01,  6.01325464e-01,
        -4.31917973e-01, -7.80421261e-01,  4.30808318e-01],
       [-8.04930775e-01, -1.16588518e+00, -1.04764375e+00,
        -3.76792898e-01, -9.72319926e-01,  2.88272390e-01],
       [ 9.01169962e-01,  2.02941082e-01,  1.84680595e-01,
         2.53308994e-01,  9.38618787e-01, -4.77433542e-01],
       [ 7.13456878e-03,  1.36172224e-01, -8.23514971e-01,
        -1.69933023e+00, -7.94092736e-01,  4.08221521e-03],
       [-1.86809888e-01, -2.98647180e-01, -3.14766271e-02,
         2.83548072e+00, -6.95642739e-01, -1.07878177e+00],
       [ 1.25434936e+00,  3.53511339e-03, -3.24798554e-01,
         1.36702906e+00,  1.54730991e+00,  6.85456099e-01],
       [-1.15880591e+00, -1.57483260e+00, -1.14917975e+00,
         2.57138778e-01, -1.23750407e+00,  1.47975612e+00],
       [-1.11412800e-01, -1.02304525e+00, -3.95523533e-01,
         5.29068511e-01, -8.72150624e-01, -1.76811249e+00],
       [ 3.18809204e-01, -2.77712325e-01, -6.12356204e-01,
        -3.44121090e-01,  9.08775612e-01, -5.19525789e-01],
       [ 2.96286583e-01,  7.42665095e-02,  1.08351311e+00,
         3.75920169e-02, -1.02450600e+00, -2.79608262e-01],
       [ 1.36247739e-01,  1.07129656e+00,  2.35637147e-01,
        -1.19655422e+00, -1.02270501e+00, -1.25229737e+00],
       [ 2.80395597e+00,  2.93104002e+00,  2.06874969e+00,
         2.35914077e-01,  1.47277025e+00, -1.67271145e+00],
       [-7.80439702e-01,  1.44179413e+00,  1.48276036e+00,
         4.79363804e-01, -3.22662856e-01, -6.52667402e-01],
       [-1.09362674e+00, -6.04136780e-01,  1.09122687e+00,
        -1.84144865e-01, -1.32872766e+00,  8.75798067e-01],
       [-7.98393323e-01, -1.85285326e-01, -7.93737089e-01,
        -1.10961262e+00, -2.91825138e-01,  1.22707052e+00],
       [ 1.16497191e+00,  3.63026095e-01,  1.03639093e+00,
         3.83362689e-01,  1.38409040e+00, -1.27104122e+00],
       [-1.92477493e-01, -2.62711760e-01,  3.40956284e-01,
         3.45887887e-01, -4.98077329e-01,  4.15974643e-01],
       [-2.51397953e-01, -1.05633529e+00, -1.59033496e+00,
        -1.02774769e+00, -5.25017088e-01, -7.83110421e-01],
       [-8.04495596e-01, -2.98841992e-01, -1.15703230e+00,
        -3.49206978e-01, -1.93361087e-01,  5.76611214e-01],
       [ 1.12128066e-02, -4.23761783e-01, -9.94863053e-02,
        -4.39426532e-02, -4.54721328e-01, -1.60980725e+00],
       [ 1.17542349e+00,  2.59170738e+00,  5.97329765e-01,
         2.15119580e+00,  8.75149827e-01, -2.88084345e-01],
       [-1.21964508e+00, -2.03949405e-02, -1.32230556e+00,
        -4.03182944e-01, -6.20077665e-01,  1.08876105e+00],
       [ 4.76034206e-01, -1.18280175e-01, -5.96007976e-01,
        -4.52834820e-02,  1.47296925e+00, -1.02805681e+00],
       [-1.80493983e+00, -1.91766421e+00, -1.19406121e+00,
        -1.07377599e+00, -1.70358763e+00,  2.28681326e+00],
       [ 3.30174370e-01,  2.62384439e-02, -6.00037611e-01,
        -8.46197526e-01,  6.19645087e-01, -9.46419614e-01],
       [-9.01099903e-01, -5.51349006e-01, -3.77032228e-01,
         5.25340771e-01, -4.52996942e-01,  5.35710910e-01],
       [-7.49279589e-02,  1.06675629e-01,  1.41135463e+00,
        -8.44203342e-01, -7.52685120e-01,  3.91944101e-01],
       [ 7.59554782e-01, -3.30102038e-01, -6.75492659e-01,
        -3.83736044e-01,  1.50750911e+00, -6.12761050e-01],
       [ 3.00540764e-01,  1.31928251e-01, -5.03272899e-01,
         8.40438193e-01,  9.44707092e-01, -1.02547010e+00],
       [ 2.29287965e+00,  2.60174168e-01,  3.15334328e-01,
         2.24055562e+00,  4.39318795e-01, -1.51947714e+00],
       [ 6.47846256e-01, -8.60774630e-02, -4.96214557e-01,
         1.36666666e+00,  7.21594263e-01,  4.04528981e-01],
       [-2.27857090e-01, -6.20998081e-02,  8.89888421e-01,
        -4.21433020e-01, -5.24435806e-01, -8.16919526e-01],
       [ 8.65858487e-01,  1.72164532e-01,  1.69361667e+00,
        -3.65758332e-01, -6.03076550e-01, -1.52131528e-01],
       [-5.36428726e-01, -1.22258152e+00, -1.22097737e+00,
        -1.00923024e+00, -5.14466803e-01, -5.87551461e-01],
       [ 9.03231980e-01,  6.94938130e-01,  1.86178146e-01,
         1.39324205e+00,  9.68942372e-01, -7.64941587e-01],
       [-2.44440600e+00, -2.11388039e+00, -2.05231690e+00,
        -1.66437098e+00, -2.59727140e+00,  3.22685653e+00],
       [ 2.41728555e+00,  2.30030989e+00, -4.84648142e-01,
         1.57994215e+00,  3.15577262e+00, -1.11930678e+00],
       [-5.14971803e-01, -8.81464070e-01, -1.28647852e+00,
        -1.17121661e+00,  6.44726755e-01,  7.79766914e-01],
       [-1.88582084e+00, -1.97190552e+00, -1.30435854e+00,
        -1.74743129e+00, -7.06236325e-01,  2.29046635e+00],
       [-1.61541113e+00, -1.23655999e+00, -8.77622111e-01,
        -1.54067934e+00, -1.29874531e+00,  2.54064246e+00],
       [ 6.10826029e-01,  1.55326522e-01,  8.31920859e-01,
         2.75871009e-01,  3.17425379e-01, -4.35108672e-01],
       [-8.20590431e-01, -1.62166106e-01, -5.56063699e-02,
        -5.71533085e-01, -6.60473353e-01,  1.02259664e+00],
       [-3.01859581e-01, -1.11915228e+00, -1.32451391e+00,
        -1.68162012e+00, -5.59240859e-01,  4.91490598e-01],
       [ 2.21171821e-01,  8.37274728e-01, -1.32922940e-01,
         3.36298234e-01,  8.95965885e-01,  1.67497551e-01],
       [-9.23081815e-02, -6.08341272e-02, -1.31189880e-01,
         2.53243978e-01, -2.84842381e-01, -8.72946582e-01],
       [ 3.74666940e-01, -8.59145384e-02,  6.47605022e-01,
         1.36775220e+00, -2.11742247e-01, -1.13988105e+00],
       [ 7.37988037e-01, -1.52723670e+00,  1.10993993e-01,
        -1.15822447e-01, -2.46040105e-01,  2.44066587e+00],
       [ 2.76828477e-01,  3.13866029e-02, -5.74894343e-01,
        -8.37016071e-01, -2.70908699e-01,  6.12282496e-02],
       [-1.33681900e-01,  7.56238110e-02, -5.65383408e-01,
        -6.14078770e-02,  5.46239320e-01, -8.22735727e-01],
       [-1.46446799e-01, -4.94782306e-02,  2.82259327e-01,
         2.88701259e-01, -7.70000655e-01, -6.27839518e-01],
       [-3.23519398e-01, -1.43818247e-03, -7.50123881e-01,
        -4.11174322e-01, -2.81543668e-01,  2.30109076e+00],
       [-7.32349880e-01, -3.00985177e-01,  6.41118453e-01,
        -2.63413293e-01, -4.80085539e-01,  8.76047508e-02],
       [-1.01489316e+00, -1.17222837e+00, -1.59166037e-01,
        -5.17662857e-01, -8.59605760e-01,  2.92003066e-01],
       [-4.37399527e-02, -2.61506964e-01, -7.17133606e-01,
        -2.99219548e-01, -8.85208756e-02,  1.13760468e+00],
       [ 6.19364031e-01,  8.58209583e-01,  1.37823812e+00,
         1.11447255e+00,  1.31696250e+00, -6.57027803e-01],
       [-1.13117052e+00, -3.88743688e-01,  2.69166933e-01,
        -2.78314555e-01, -5.12072645e-01,  1.40380773e-01],
       [-8.38241007e-01, -9.39055918e-01, -6.41378329e-01,
        -7.33806574e-02, -9.03308127e-01,  1.79148152e+00],
       [ 9.03807395e-01, -4.90086843e-02,  1.68513745e+00,
        -1.69989196e-01, -4.51278554e-01, -4.28051121e-02],
       [-6.18749600e-01,  1.18549547e-01,  1.19498526e+00,
         2.83995868e-01, -2.16196829e-01,  1.08350283e-01],
       [ 8.93813524e-01,  3.48520055e-01, -4.04804791e-02,
         8.25667885e-01,  2.03061212e+00, -1.38639690e+00],
       [ 7.68696843e-01, -3.74721474e-01, -5.65998653e-01,
        -1.21254583e-01,  1.12346172e+00, -1.23623983e+00],
       [ 9.49566324e-01, -6.75953785e-02, -6.00097714e-02,
         1.42760858e+00,  3.08278138e-01, -7.00248646e-01],
       [ 5.65779399e-01,  3.84949998e-01, -7.81069159e-01,
        -5.86464256e-01,  6.75737111e-01, -6.13951951e-01],
       [-1.14043117e+00, -1.30144729e+00, -2.59038547e-02,
        -2.01630426e+00, -1.49040506e+00, -3.06595803e-01],
       [ 8.90597781e-01,  1.20849750e+00,  1.62599559e+00,
         1.17828317e+00,  1.12253709e+00,  3.02322535e-01],
       [-2.63580178e-01,  2.91484939e-01, -6.43039234e-01,
        -9.74659352e-01, -7.19784148e-01,  1.19554371e-01],
       [-8.15766477e-01,  5.92221993e-01,  7.23795611e-02,
        -8.63694327e-02, -1.27145444e-01,  7.94737106e-02],
       [-4.57552445e-01, -1.08479778e+00, -3.42025153e-01,
        -3.23721744e-01,  1.52142319e+00,  7.34730560e-01],
       [ 8.47199300e-02,  7.66142081e-02,  7.13133191e-01,
         5.41695218e-01, -4.76522499e-01, -1.08194553e+00],
       [ 2.81170369e-01,  4.14886792e-04, -9.97927358e-01,
        -7.65128058e-01,  2.60118001e-01,  6.36808351e-01],
       [-1.49814824e+00, -6.16805378e-01,  1.17677650e+00,
        -9.22114942e-01, -1.87528059e+00,  1.20791303e+00],
       [ 9.53037917e-01,  2.09910220e-01,  1.11888904e+00,
         1.00681388e+00, -1.89381932e-01, -1.57451542e-01],
       [-5.20253294e-01, -2.81975778e-01, -1.49538139e+00,
        -8.30919034e-01,  1.37431237e+00, -2.74707584e-01],
       [-9.09171947e-01, -4.65631492e-01, -9.26616864e-01,
        -7.14772325e-01, -1.46302508e-01,  1.17576454e+00],
       [ 2.67096110e-01,  2.70679796e-02, -6.12715173e-01,
         7.61909786e-01,  2.02498859e+00, -2.61164555e+00],
       [ 3.07169779e+00,  2.46079742e+00,  9.85065968e-01,
         2.36613146e+00,  2.18654592e+00, -1.18780992e+00],
       [ 1.85842277e+00,  9.20189940e-01,  6.51523004e-01,
         1.31407114e+00,  1.28801744e+00,  1.34156354e-01],
       [ 5.64302432e-01,  5.37779941e-01, -2.87012507e-01,
         1.07989408e-01,  1.16863594e+00,  1.07696738e+00]])
# Run PCA with two components
pca_countries = PCA(n_components=2)
pca_countries = pca_countries.fit_transform(StandardisedData)
pca_countries
array([[-4.02538133, -0.6174681 ],
       [-0.49737301, -0.26349688],
       [ 0.86989898,  2.15270562],
       [ 2.26224458, -2.17793824],
       [ 0.67771231,  0.68974427],
       [-0.7975852 , -0.78321737],
       [ 1.00966412,  0.64593033],
       [ 0.59147106, -0.94916286],
       [-1.17873926, -0.82027452],
       [ 2.64943172,  1.09997072],
       [-1.86207273,  0.46481139],
       [ 0.76561012,  1.85678308],
       [ 1.04176943,  0.76986901],
       [-3.09072418, -0.26853696],
       [ 2.55478065, -0.82928347],
       [-1.43854716,  0.30278396],
       [-2.33422296,  0.34383059],
       [-0.44354731, -1.79791041],
       [-0.92632206,  0.65286561],
       [-4.60065143, -0.3526588 ],
       [ 1.23761254, -0.44446317],
       [-2.43196647, -0.94053599],
       [-0.0396591 ,  1.44075467],
       [ 3.37081588,  0.20593048],
       [ 3.60946234, -1.27579945],
       [ 1.97189645, -1.14292138],
       [ 2.56498225, -0.95778855],
       [-1.234834  ,  1.56269779],
       [-0.89610842,  0.77653433],
       [-2.87729919, -0.85188668],
       [-1.10003758,  2.4115453 ],
       [ 2.07137863, -0.05601131],
       [ 2.04832559, -0.22876103],
       [-1.96817794, -0.43899588],
       [ 1.72760907, -0.24700733],
       [-1.31072762,  1.2328729 ],
       [-0.68136237, -0.89395176],
       [-1.02342042,  1.7938902 ],
       [ 0.30038644,  1.48305134],
       [-0.56475851,  0.25430994],
       [ 0.62573214,  0.50595649],
       [-0.34634697,  2.01494926],
       [-1.67805167, -1.01188128],
       [-1.04807719,  1.99699759],
       [ 1.86952295, -0.55904891],
       [ 0.38278889, -0.14758368],
       [ 2.34260965, -0.1044547 ],
       [-0.51580031, -0.8883035 ],
       [-1.51366877,  1.18940481],
       [ 0.39341819, -0.79872181],
       [-0.01114925, -0.1927295 ],
       [-2.19241252, -0.37956503],
       [ 1.67429827,  0.97889476],
       [ 0.76238022,  0.70019298],
       [ 1.14337993,  0.94926541],
       [ 1.86610563, -0.27334771],
       [-1.23672511, -0.39906198],
       [ 1.22067736, -0.15171965],
       [-1.12490964,  0.3816932 ],
       [-1.45331585, -1.35663792],
       [ 2.53132539, -0.34152059],
       [ 0.07635394,  0.34267495],
       [-0.26282846, -0.96566037],
       [-0.19486024,  1.47001733],
       [-0.16614612,  0.95603256],
       [-4.56684638,  0.85584962],
       [-1.02748371,  1.5077458 ],
       [ 1.42536124,  1.62073054],
       [ 1.74890236, -0.51669169],
       [-2.21006668,  0.1067787 ],
       [ 0.32052248,  0.51034537],
       [ 1.40413598, -0.86118489],
       [ 1.29136442, -0.82787035],
       [-0.1954776 ,  0.35009491],
       [-3.28844573, -0.12271942],
       [ 1.79683488, -0.74375072],
       [-0.96113447, -1.23718232],
       [ 4.07325275, -0.10745501],
       [-0.23883241, -0.72303115],
       [ 0.94071791, -0.07306015],
       [ 0.40157811,  1.57096199],
       [-0.68742445, -1.37478027],
       [-1.18871687, -0.8847392 ],
       [-3.01593642, -0.07239287],
       [-0.85347087, -0.9570515 ],
       [-0.03392099,  1.14234796],
       [-0.62753363,  1.69170428],
       [ 1.57220806, -0.57066775],
       [-2.0747674 , -0.42531514],
       [ 5.67649105, -0.33687563],
       [-4.41732558, -2.2668526 ],
       [ 1.55261452, -1.42141931],
       [ 4.06261882, -0.72921788],
       [ 3.72219111, -0.09500059],
       [-1.006526  ,  0.49283381],
       [ 1.3769067 ,  0.29459127],
       [ 2.16091242, -0.73608418],
       [-0.88537447, -0.65462671],
       [-0.21843873,  0.14576936],
       [-1.26142102,  0.68951695],
       [ 1.38276472, -0.09851113],
       [ 0.51000828, -0.30615458],
       [-0.31383109, -0.6660897 ],
       [-0.04389407,  0.73195881],
       [ 1.52889794, -0.66440569],
       [ 0.6411404 ,  0.84207179],
       [ 1.73475785,  0.40537912],
       [ 0.93795758, -0.64529472],
       [-2.32831808,  0.36806134],
       [ 1.01472264,  0.57926836],
       [ 2.07553809, -0.15481523],
       [-0.64673949,  1.57443292],
       [-0.09337739,  1.11480861],
       [-2.26266843, -1.12136283],
       [-0.89936158, -1.01176713],
       [-1.4090372 , -0.2704694 ],
       [-0.470497  , -0.96210887],
       [ 2.4875684 ,  1.00438601],
       [-2.27219104,  0.55602964],
       [ 0.92875377, -0.06879961],
       [ 0.21709137,  0.18736401],
       [ 0.61731623, -1.18414684],
       [-0.73116416,  0.94117159],
       [ 0.61715546, -1.01720835],
       [ 2.26182563,  2.02073622],
       [-1.28376206,  0.92959826],
       [ 0.50522587, -1.89626584],
       [ 1.71784796, -0.71225838],
       [-2.09230959, -1.40606926],
       [-5.1962351 , -0.59145889],
       [-2.51413282, -0.38685826],
       [-0.53435688, -1.0511773 ]])
# Select optimal number of clusters, k

Sum_of_squared_differences_countries = []

K = range(1,31)
for k in K:
  kmeans_countries = KMeans(n_clusters=k)
  kmeans_countries.fit(pca_countries)
  Sum_of_squared_differences_countries.append(kmeans_countries.inertia_)
Sum_of_squared_differences_countries
[598.6492960967477,
 300.5368638807781,
 209.5233130281315,
 152.4288049171476,
 124.46755752268106,
 99.65003977629307,
 81.61769565723917,
 70.98658102067833,
 61.841001597457954,
 56.36763744841132,
 50.4524816636799,
 47.20788466618494,
 44.143520206896966,
 38.8064957059904,
 37.02293693432438,
 35.03486858583522,
 30.466760140243668,
 28.0902069358879,
 26.90592024311007,
 24.261911497852388,
 23.051504550872703,
 21.408901487619993,
 19.978304038447277,
 18.44200538527873,
 17.715525487223328,
 17.02732339300026,
 15.814699101184294,
 14.999695092238197,
 13.72947606719006,
 13.635915245698124]
# Plot the number of clusters against the sum of squared differences

# Plot and font size
plt.figure(figsize=(11.7,8.27))
sns.set(font_scale=1.5)

# Generate the plot
grid = sns.lineplot(x = K, y = Sum_of_squared_differences_countries)

# Add x and y labels
labels = grid.set(xlabel='Number of clusters, k', ylabel='Total squared distances')
_images/pattern_discovery_unsupervised_learning_71_0.png
# k = 4 appears optimal so we specify n_clusters=4  and run the KMeans algorithm
kmeans_countries_k4 = KMeans(n_clusters=4)
kmeans_countries_k4.fit(pca_countries)
KMeans(n_clusters=4)
# Labels of clusters each country belongs to
kmeans_countries_k4.labels_
array([1, 0, 3, 2, 3, 0, 3, 0, 0, 2, 1, 3, 3, 1, 2, 3, 1, 0, 3, 1, 2, 1,
       3, 2, 2, 2, 2, 3, 3, 1, 3, 2, 2, 1, 2, 3, 0, 3, 3, 3, 3, 3, 0, 3,
       2, 0, 2, 0, 3, 0, 0, 1, 2, 3, 3, 2, 0, 2, 3, 0, 2, 3, 0, 3, 3, 1,
       3, 3, 2, 1, 3, 2, 2, 3, 1, 2, 0, 2, 0, 2, 3, 0, 0, 1, 0, 3, 3, 2,
       1, 2, 1, 2, 2, 2, 3, 2, 2, 0, 3, 3, 2, 0, 0, 3, 2, 3, 2, 2, 1, 3,
       2, 3, 3, 1, 0, 0, 0, 2, 1, 2, 3, 0, 3, 0, 2, 3, 0, 2, 1, 1, 1, 0],
      dtype=int32)
# Plot the clusters along the two principal components

sns.set(font_scale=1.3)
plt.figure(figsize=(20.7,16.27))

grid = sns.scatterplot(x = pca_countries[:,0], y = pca_countries[:,1], hue=kmeans_countries_k4.labels_)

for label in range(0,mobility_trends_countries.shape[0]):
  grid.text(pca_countries[label,0], pca_countries[label,1],
  mobility_trends_countries.index[label])
_images/pattern_discovery_unsupervised_learning_74_0.png
# Add the cluster membership as a new column
mobility_trends_countries['clusters_countries_k4'] = kmeans_countries_k4.labels_
mobility_trends_countries
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters_countries_k4
country_region
Afghanistan 11.746193 33.555838 4.195332 -5.002395 -8.276850 4.758950 1
Angola -14.925453 -3.229062 3.151481 -28.208609 -12.878265 8.627155 0
Antigua and Barbuda -20.449886 -9.677677 29.333713 -44.746019 -34.656949 5.708029 3
Argentina -42.767622 -9.831666 -60.221476 -47.383363 -13.202182 10.534182 2
Aruba -22.455581 -7.694761 8.833713 -46.858770 -22.462882 6.382353 3
... ... ... ... ... ... ... ...
Venezuela -32.081897 -8.306034 -27.659483 -36.493534 -21.640086 13.185345 2
Vietnam -17.321342 -1.864373 -19.920488 -15.088011 -2.644026 -3.584681 1
Yemen 17.872570 29.954741 19.471526 8.166287 -1.230603 2.719828 1
Zambia 2.647629 9.812500 11.248292 -7.084052 -9.091595 8.573276 1
Zimbabwe -13.591810 4.812787 -11.890550 -24.567037 -10.136032 12.747887 0

132 rows × 7 columns

# Check in which cluster the United Kingdom was assigned
UK_cluster = mobility_trends_countries[mobility_trends_countries.index == 'United Kingdom']
UK_cluster
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters_countries_k4
country_region
United Kingdom -39.472743 -10.282515 24.197996 -39.499108 -36.766465 13.327693 2
# Access the UK cluster label
UK_cluster.clusters_countries_k4[0]
2
# Identify which other countries were assigned to the same cluster
# These countries were found to be similar to the United Kingdom in terms of mobility trends since mid-February 2020
mobility_trends_countries[mobility_trends_countries.clusters_countries_k4 == UK_cluster.clusters_countries_k4[0]]
Retail_Recreation Grocery_Pharmacy Parks Transit_stations Workplaces Residential clusters_countries_k4
country_region
Argentina -42.767622 -9.831666 -60.221476 -47.383363 -13.202182 10.534182 2
Barbados -30.920817 -22.433538 -5.081061 -49.439210 -36.433215 9.512949 2
Bolivia -35.997935 -19.102914 -31.701567 -41.995431 -21.858200 14.288977 2
Cambodia -18.407269 -14.164995 -20.483063 -40.268325 -22.197020 10.603417 2
Cape Verde -39.040680 -16.577710 -33.111479 -62.015756 -35.452140 9.100478 2
Chile -44.266674 -25.772082 -47.509813 -43.702189 -23.269113 16.791091 2
Colombia -36.840248 -14.690320 -32.774738 -37.933917 -17.016716 12.739796 2
Costa Rica -35.698276 -17.092672 -44.469828 -41.903017 -25.431034 11.773707 2
Dominican Republic -31.785243 -11.837258 -33.502995 -37.631738 -33.087226 8.748230 2
Ecuador -25.154979 -11.445300 -28.751560 -34.073760 -30.943330 14.585333 2
El Salvador -28.428939 -11.844498 -32.090541 -34.760691 -29.947196 9.074017 2
Guatemala -27.042781 -17.369125 -32.112737 -38.151055 -25.606951 10.235432 2
Honduras -38.872129 -17.379333 -32.819142 -41.168144 -30.528436 6.763165 2
Ireland -36.003271 -0.026192 0.298277 -37.251915 -32.600999 11.942440 2
Jamaica -30.773813 -17.461305 -30.643304 -31.594290 -28.866697 9.255676 2
Jordan -20.583504 -0.437920 -25.117588 -50.765379 -27.307434 7.997329 2
Kuwait -35.214461 -22.807974 -33.146592 -22.405014 -31.186724 14.531369 2
Malaysia -30.691776 -4.640727 -24.383438 -42.217016 -22.913226 13.412519 2
Mauritius -23.827733 -16.029026 -44.022921 -41.030328 -24.953361 4.511770 2
Mexico -30.768352 -6.125392 -33.340191 -31.194413 -22.051789 10.532393 2
Morocco -35.977909 -2.484914 -37.414871 -31.976832 -25.785022 12.800108 2
Myanmar (Burma) -43.322557 -27.290230 -34.253109 -41.697540 -35.264368 18.104885 2
Nepal -31.980603 -9.426724 -14.109914 -18.517241 -24.323276 10.351293 2
Oman -27.404479 -18.202566 -34.916705 -40.761905 -24.861060 5.377673 2
Panama -51.346983 -29.855603 -55.412716 -50.258621 -43.082974 22.267241 2
Paraguay -27.135224 -13.742721 -36.531584 -43.110009 -14.719577 11.431933 2
Peru -44.337503 -27.999393 -36.972401 -51.462637 -26.538801 18.121060 2
Philippines -40.944232 -18.385323 -26.451556 -48.465625 -31.722508 19.228799 2
Portugal -30.970320 -4.338461 -6.185417 -34.417183 -26.138433 12.507143 2
Puerto Rico -24.460958 -16.850309 -37.469316 -50.508659 -25.252776 10.155493 2
Rwanda -11.412293 -22.185695 -2.078018 -27.811344 -22.512664 18.786119 2
Singapore -24.732759 -2.237069 -23.308190 -32.092672 -22.823276 18.168103 2
Slovenia -33.408553 -17.544238 -8.738597 -33.636297 -27.880590 9.272195 2
South Africa -21.221910 -5.637265 -22.494841 -30.469811 -21.134570 13.016379 2
Sri Lanka -31.191810 -14.495690 -20.627155 -27.196121 -28.262931 15.911638 2
The Bahamas -34.983883 -19.233674 -5.453125 -55.360136 -33.399289 6.621698 2
Trinidad and Tobago -23.980603 1.592672 -20.668103 -40.260776 -26.657328 8.508621 2
United Kingdom -39.472743 -10.282515 24.197996 -39.499108 -36.766465 13.327693 2
Venezuela -32.081897 -8.306034 -27.659483 -36.493534 -21.640086 13.185345 2