Learning from Data to Predict¶

Key themes¶

• Supervised learning

• Machine learning tasks — e.g., regression (continuous) and classification (binary)

• Building and evaluation of simple prediction models

• The problem of model overfitting and strategies to avoid it:

• Splitting the data into training set and testing set

• Cross-validation

• Introduction to supervised machine learning algorithms, including k-Nearest Neighbors and Logistic Regression

Learning resources¶

Predictability of life trajectories by Matthew Salganik

Introduction to Machine Learning Methods by Susan Athey

Machine Learning with Scikit Learn by Jake VanderPlas

M Molina & F Garip. 2019. Machine learning for sociology. Annual Review of Sociology. Link to an open-access version of the article available at the Open Science Framework.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane. 2021. Chapter 7: Machine Learning. In Big Data and Social Science (2nd edition).

Aurélien Géron. 2019. Chapter 2: End-to-end Machine Learning project. In Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Edition). O’Reilly.

Prediction is a data science task among other data science tasks, including description and causal inference. Prediction is the use of data to map some input (X) to an output (Y). The prediction task is called classification when the output variable is categorical (or discrete), and regression when it is continuous. Our focus in this session will be on classification.

There are many prediction problems in social sciences (summarised in Kleinberg et al. 2015) that can benefit from (supervised) machine learning, for example:

• In child protection, predicting when kids are in danger;

• In the criminal justice system, predicting whether to detain or release arrestees as they await adjudication of their case (e.g., Kleinberg et al. 2015);

• In population health, predicting suicides;

• In education, predicting which teacher will have the greatest value add (e.g., Rockoff et al., 2011);

• In higher education, predicting earlier university dropouts;

• In labor market policy, predicting unemployment spell length to help workers decide on savings rates and job search strategies;

• In social policy, predicting highest risk youth for targeting interventions (e.g., Chandler et al., 2011);

• In sociology, predicting life outcomes (Salganik et al. 2020).

Predictions gone wrong¶

Prediction and machine learning models went wrong in a few occasions in different domains, including public health, education, the criminal justice system, and healthcare:

Regardless of whether you use or not machine learning in your research, knowledge about prediction and machine learning techniques can help you evaluate how those techniques are used across domains and possibly identify ethical challenges and potential biases in those applications. Importantly, such data ethics challenges are found to reside not only in the machine learning algorithms themselves but in the entire data science ‘pipeline’ or ecosystem.

Supervised learning¶

Learn a model from labeled training data or outcome variable that would enable us to make predictions about unseen or future data. The learning is called supervised because the labels (e.g., email Spam or Ham where ‘Ham’ is e-mail that is not Spam) of the outcome variable (Y) that guide the learning process are already known.

Research problem: vaccine hesitancy¶

We will aim to predict people who are unlikely to take a coronavirus vaccine (Y) from socio-demographic and health input features (X). An unbiased prediction of individuals who are unlikely to vaccinate can inform targeted public health interventions, including information campaigns disseminating evidence-based information about Covid-19 vaccines.

Data: Understanding Society COVID-19¶

We will use data from The Understanding Society: Covid-19 Study. The survey asks participants across the UK about their experiences during the COVID-19 outbreak. We will use Wave 6 (November 2020) of the survey.

The data are safeguarded and can be downloaded from the UK Data Service.

# Import the Drive helper

# This will prompt for authorization. Enter your authorisation code and rerun the cell.
drive.mount('/content/drive')

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-9b2f03ff29d8> in <module>
1 # Import the Drive helper
----> 2 from google.colab import drive
3
4 # This will prompt for authorization. Enter your authorisation code and rerun the cell.
5 drive.mount('/content/drive')



import pandas as pd
import numpy as np


# Display all columns in the Understanding Society: COVID-19 Study
pd.options.display.max_columns = None


USocietyCovid.shape

(12035, 916)

USocietyCovid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12035 entries, 0 to 12034
Columns: 916 entries, pidp to cf_betaindin_lw_t2
dtypes: float64(10), int64(855), object(51)
memory usage: 84.1+ MB


Defining Output and Input variables¶

Here are the Output and Input data features we will use in this session.

Outcome: Output (Y)¶

Description

Variable

Values

Likelihood of taking up a coronavirus vaccination

cf_vaxxer

1 = Very likely, 2 = Likely, 3 = Unlikely, 4 = Very unlikely

Predictors: Input features (X)¶

We select 4 (demographic and health-related) variables as examples only, no prior literature or expert knowledge is considered. We will discuss the role of prior literature and expert knowledge in the process of variable selection when we learn causal inference approaches.

Description

Variable

Values

Age

cf_age

Integer values (whole numbers)

Respondent sex

cf_sex_cv

1 = Male, 2 = Female, 3 = Prefer not to say

General health

cf_scsf1

1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor

At risk of serious illness from Covid-19

cf_clinvuln_dv

0 = no risk (not clinically vulnerable), 1 = moderate risk (clinically vulnerable), 2 = high risk (clinically extremely vulnerable)

Data wrangling¶

# Select output y and input X variables
USocietyCovid = USocietyCovid[['cf_vaxxer', 'cf_age','cf_sex_cv','cf_scsf1', 'cf_clinvuln_dv']]

cf_vaxxer cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
0 2 37 2 2 0
1 3 35 1 4 0
2 3 55 2 2 0
3 1 38 1 3 1
4 1 67 2 2 0
import seaborn as sns
sns.set_context("notebook", font_scale=1.5)
%matplotlib inline

fig = sns.catplot(x="cf_vaxxer",
kind = "count",
height=6, aspect=1.5, palette="ch:.25",
data = USocietyCovid)

# Tweak the plot
(fig.set_axis_labels("Likelihood of taking up a coronavirus vaccination", "Frequency")
.set_xticklabels(["missing", "inapplicable", "refusal", "don't know", "Very likely","Likely","Unlikely","Very unlikely"])
.set_xticklabels(rotation=45))

<seaborn.axisgrid.FacetGrid at 0x7f9209efc670>


Missing observations in Understanding Society are indicated by negative values. Let’s convert negative values to NaN using the function mask in pandas. An alternative approach would be to reload the data using the Pandas read_csv() function and provide the negative values as an argument to the parameter na_values, as a result of which Pandas will recognise these values as NaN.

# The function 'mask' in pandas replaces values where a condition is met.
# Alternatively, you could replace negative values with another value, e.g., 0, using the code USocietyCovid.mask(USocietyCovid < 0, 0)
USocietyCovid

cf_vaxxer cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
0 2.0 37 2 2.0 0.0
1 3.0 35 1 4.0 0.0
2 3.0 55 2 2.0 0.0
3 1.0 38 1 3.0 1.0
4 1.0 67 2 2.0 0.0
... ... ... ... ... ...
12030 1.0 57 1 2.0 0.0
12031 2.0 70 2 3.0 1.0
12032 2.0 64 1 2.0 0.0
12033 4.0 31 1 1.0 0.0
12034 3.0 41 2 3.0 0.0

12035 rows × 5 columns

# Remove NaN
USocietyCovid = USocietyCovid[['cf_vaxxer', 'cf_age', 'cf_sex_cv', 'cf_scsf1', 'cf_clinvuln_dv']].dropna()

USocietyCovid

cf_vaxxer cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
0 2.0 37 2 2.0 0.0
1 3.0 35 1 4.0 0.0
2 3.0 55 2 2.0 0.0
3 1.0 38 1 3.0 1.0
4 1.0 67 2 2.0 0.0
... ... ... ... ... ...
12030 1.0 57 1 2.0 0.0
12031 2.0 70 2 3.0 1.0
12032 2.0 64 1 2.0 0.0
12033 4.0 31 1 1.0 0.0
12034 3.0 41 2 3.0 0.0

11930 rows × 5 columns

# Plot the new cf_vaxxer (vaccination likelihood) variable

fig = sns.catplot(x="cf_vaxxer",
kind = "count",
height=6, aspect=1.5, palette="ch:.25",
data = USocietyCovid)

# Tweak the plot
(fig.set_axis_labels("Likelihood of taking up a coronavirus vaccination", "Frequency")
.set_xticklabels(["Very likely","Likely","Unlikely","Very unlikely"])
.set_xticklabels(rotation=45))

<seaborn.axisgrid.FacetGrid at 0x7f92284700d0>


To simplify the problem, we will recode cf_vaxxer (vaccination likelihood) variable into a binary variable where 1 refers to ‘Likely to take up a Covid-19 vaccine’ and 2 refers to ‘Unlikely to take up a Covid-19 vaccine’. To achieve this, we use the replace() method which replaces a set of values we specify (in our case, [1,2,3,4]) with another set of values we specify (in our case, [1,1,0,0]).

# Recode cf_vaxxer into a binary variable
USocietyCovid['cf_vaxxer'] = USocietyCovid['cf_vaxxer'].replace([1,2,3,4],[1,1,0,0])

cf_vaxxer cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
0 1.0 37 2 2.0 0.0
1 0.0 35 1 4.0 0.0
2 0.0 55 2 2.0 0.0
3 1.0 38 1 3.0 1.0
4 1.0 67 2 2.0 0.0
# Plot the binary cf_vaxxer (vaccination likelihood) variable
fig = sns.catplot(x="cf_vaxxer",
kind = "count",
height=6, aspect=1.5, palette="ch:.25",
data = USocietyCovid)

# Tweak the plot
(fig.set_axis_labels("Likelihood of taking up a coronavirus vaccination", "Frequency")
.set_xticklabels(["Unlikely","Likely"])
.set_xticklabels(rotation=45))

<seaborn.axisgrid.FacetGrid at 0x7f92586612b0>

USocietyCovid.groupby('cf_vaxxer').count()

cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
cf_vaxxer
0.0 1837 1837 1837 1837
1.0 10093 10093 10093 10093
USocietyCovid.shape[0]

11930

# 84.6% of respondents very likely or likely to take up a Covid vaccine and 15.4% very unlikely or unlikely
USocietyCovid.groupby('cf_vaxxer').count()/USocietyCovid.shape[0]

cf_age cf_sex_cv cf_scsf1 cf_clinvuln_dv
cf_vaxxer
0.0 0.153982 0.153982 0.153982 0.153982
1.0 0.846018 0.846018 0.846018 0.846018

So far, we have described our outcome variable to make sense of the task but we have neither looked at the predictor variables nor examined any relationships between predictor variables and outcomes. It is a good practice to first split the data into training set and test set and only then explore predictors and relationships in the training set.

Overfitting and data splitting¶

The problem of model overfitting¶

Overfitting occurs when model captures ‘noise’ in a specific sample while failing to recognise general patterns across samples. As a result of overfitting, the model produces accurate predictions for examples from the sample at hand but will predict poorly new examples the model has never seen.

Training set, Validation set, and Test set¶

To avoid overfitting, data is typically split into three groups:

• Training set — used to train models

• Validation set — used to tune the model and estimate model performance/accuracy for best model selection

• Test set - used to evaluate the generalisability of the model to new observations the model has never seen

If your data set is not large enough, a possible strategy, which we will use here, is to split the data into training set and test set, and use cross-validation on the training set to evaluate our models’ performance/accuracy. We will use 2/3 of the data to train the predictive model and the remaining 1/3 to create the test set.

# Split train and test data

from sklearn.model_selection import train_test_split

# Outcome variable
y = USocietyCovid[['cf_vaxxer']]

# Predictor variables
X = USocietyCovid[['cf_age', 'cf_sex_cv', 'cf_scsf1', 'cf_clinvuln_dv']]

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y, random_state=0)

print('Train data', X_train.shape, '\n''Test data', X_test.shape)

Train data (7993, 4)
Test data (3937, 4)


Preprocessing the training data set¶

Categorical predictors — dummy variables¶

Categorical variables are often encoded using numeric values. For example, Respondent sex is recorded as 1 = Men, 2 = Female, 3 = Prefer not to say. The numeric values can be ‘misinterpreted’ by the algorithms — the value of 1 is obviously less than the value of 3 but that does not correspond to real-world numerical differences.

A solution is to convert categorical predictors into dummy variables. Basically, each category value is converted into a new column and assigns a 1 or 0 (True/False) values using the function get_dummies in pandas. The function creates dummy/indicator variables that contain value of 1 or 0.

The Respondent sex variable is converted below is three columns of 1s or 0s corresponding to the respective value.

# Use get_dummies to convert the Respondent sex categorical variable into 3 dummy/indicator variables
X_train_predictors = pd.get_dummies(X_train, columns=["cf_sex_cv"])

cf_age cf_scsf1 cf_clinvuln_dv cf_sex_cv_1 cf_sex_cv_2 cf_sex_cv_3
994 28 2.0 0.0 0 1 0
11376 41 5.0 0.0 0 1 0
9730 77 3.0 1.0 0 1 0
8235 41 3.0 2.0 0 1 0
1406 60 2.0 0.0 0 1 0
# Create two DataFrames, one for numerical variables and one for categorical variables
X_train_predictors_cat = X_train_predictors[['cf_sex_cv_1', 'cf_sex_cv_2', 'cf_sex_cv_3']]
X_train_predictors_cont = X_train_predictors[['cf_age', 'cf_scsf1', 'cf_clinvuln_dv']]


Continuous predictors — standardisation¶

We standardise the continuous input variables.

# Standardise the predictors using the StandardScaler function in sklearn
from sklearn.preprocessing import StandardScaler  # For standartising data
scaler = StandardScaler() # Initialising the scaler using the default arguments
X_train_predictors_cont_scale = scaler.fit_transform(X_train_predictors_cont) # Fit to continuous input variables and return the standardised dataset
X_train_predictors_cont_scale

array([[-1.65608796, -0.56081614, -0.80754698],
[-0.84497135,  2.64868513, -0.80754698],
[ 1.40119771,  0.50901761,  0.81873748],
...,
[-0.96975852, -1.6306499 , -0.80754698],
[ 0.46529394,  0.50901761,  2.44502193],
[ 0.77726186,  0.50901761,  0.81873748]])


Combine categorical and continuous predictors into one data array¶

# Use the concatenate function in Numpy to combine all variables — both categorical and continuous predictors — in one array
X_train_preprocessed = np.concatenate([X_train_predictors_cont_scale,X_train_predictors_cat], axis = 1)
X_train_preprocessed

array([[-1.65608796, -0.56081614, -0.80754698,  0.        ,  1.        ,
0.        ],
[-0.84497135,  2.64868513, -0.80754698,  0.        ,  1.        ,
0.        ],
[ 1.40119771,  0.50901761,  0.81873748,  0.        ,  1.        ,
0.        ],
...,
[-0.96975852, -1.6306499 , -0.80754698,  0.        ,  1.        ,
0.        ],
[ 0.46529394,  0.50901761,  2.44502193,  0.        ,  1.        ,
0.        ],
[ 0.77726186,  0.50901761,  0.81873748,  1.        ,  0.        ,
0.        ]])

X_train_preprocessed.shape

(7993, 6)


Unbalance class problem¶

In the case of the vaccination likelihood question, one of the classes (likely to vaccinate) has a significantly greater proportion of cases (84.6%) than the other case (unlikely to vaccinate) (15.4%). We therefore face an unbalanced class problem.

Different methods to mitigate the problem exist. We will use a method called ADASYN: Adaptive Synthetic Sampling Method for Imbalanced Data. The method oversamples the minority class in the training data set until both classes have an equal number of observations. Hence, the data set we use to train our models contains two balanced classes.

from imblearn.over_sampling import ADASYN

# Initialization of the ADASYN resampling method; set random_state for reproducibility

# Fit the ADASYN resampling method to the train data


The resulting X_train_balance and y_train_balance now include both the original data and the resampled data. The y_train_balance now includes an almost equal number of labels for each class.

# Now that the two classes are balanced, the train data is ~14K observations, greater than the original ~8K.
X_train_balance.shape

(13611, 6)


Hands-on mini-exercise¶

Verify that after the oversampling the y_train_balance data object contains indeed approximately equal number of observations for both classes, those likely to vaccinate (1) and those unlikely to vaccinate (0).

Note that y_train_balance is a NumPy array. You can check that on your own using the function type(), for example: type(y_train).

(y_train_balance == 0).sum()

cf_vaxxer    6849
dtype: int64

(y_train_balance == 1).sum()

cf_vaxxer    6762
dtype: int64


Train models on training data¶

We fit two classifiers — k-Nearest Neighbours (k-NN) and Logistic Regression — on the training data. The k-Nearest Neighbours classifier (k-NN) and Logistic Regression classifier are two widely used classifiers. Our focus is on the end-to-end workflow so we do not discuss the workings of the two classifiers in detail. To learn more about the two classifiers, see Python Data Science Handbook by Jake VanderPlas (on k-NN) and the DataCamp course Supervised Learning with scikit-learn.

In the models below, we use the default hyperparameters (hyperparameter are parameters that are not learned from data but are set by the researcher to guide the learning process) for both classifiers.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Create an instance of k-nearest neighbors (k-NN) classifier.
# We set the hyperparameter n_neighbors=5 meaning that the label of an unknown respondent (0 or 1) is a function of the labels of its five closest training respondents.
kNN_Classifier =  KNeighborsClassifier(n_neighbors = 5)

# Create an instance of Logistic Regression Classifier
LogReg_Classifier =  LogisticRegression()

# Fit both models to the training data
kNN_Classifier.fit(X_train_balance, y_train_balance.cf_vaxxer)
LogReg_Classifier.fit(X_train_balance, y_train_balance.cf_vaxxer)

LogisticRegression()


Model evaluation using Cross-validation¶

Now that your two models are fitted, you can evaluate the accuracy of their prediction. In older approaches, the prediction accuracy is often calculated on the same set of training data used to fit the model. The problem of such an approach is that the model can ‘memorise’ the training data and show high prediction accuracy on that data set while failing to perform well on new data. For this reason, approaches in data science, and machine learning in particular, prefer to evaluate the prediction accuracy of a model on new data that has not been used in training the model.

The cross-validation technique¶

Cross-Validation is a methodology to assess accuracy of model prediction without relying on in-sample prediction. We will split our training set into k equal folds or parts. The number of folds can differ but for simplicity we will consider 5-fold cross-validation. How does 5-fold cross-validation work? While keeping aside one fold (or part), we fit the model with the remaining four folds and use the fitted model to predict outcomes of observations in fold one and on this basis compute model prediction accuracy. We repeat the procedure for all 5 folds or parts of the data and compute the average prediction accuracy.

Metrics to evaluate model performance¶

Many metrics to evaluate model performance exist. We evaluate model performance using the accuracy score. The accuracy score is the simplest metric for evaluating classification models. Simply put, accuracy is the proportion of predictions our model got right. Keep in mind, however, that because of the unbalanced class problem, accuracy may not be the best metric in our case. Because one of our classes accounts for 84.6% of the cases, even a model that uniformly predicts that all respondents are likely to take up the vaccine will obtain very high accuracy of 0.846 while being useless for identifying respondents who are unlikely to take up the vaccine. We will return to this problem shortly.

# Import the function cross_val_score() which performs cross-validation and evaluates the model using a score
# Many scores are available to evaluate a classification mdoel, we select the simplest one called accuracy
from sklearn.model_selection import cross_val_score

# Evaluate the kNN_Classifier model via 5-fold cross-validation
kNN_score = cross_val_score(kNN_Classifier, X_train_balance, y_train_balance.cf_vaxxer, cv=5, scoring='accuracy')
kNN_score

array([0.6232097 , 0.65686995, 0.64070536, 0.65980896, 0.64731815])

# Take the mean across the five accuracy scores
kNN_score.mean()*100

64.55824239753719

# Repeat for our logistic regression model
LogReg_score = cross_val_score(LogReg_Classifier, X_train_balance, y_train_balance.cf_vaxxer, cv=5, scoring='accuracy')
LogReg_score.mean()*100

62.21431553077533


The output from the cross-validation technique shows that the performance of our two models is comparable as measured by the accuracy score.

At this stage, we could fine-tune model hyperparameters — i.e., parameters that the model does not learn from data, e.g., the number of k neighbours in the k-NN algorithm — and re-evaluate model performance. During the process of model validation, we do not use the test data. Once we are happy with how our model(s) perform, we test the model on unseen data.

Testing model accuracy on new data¶

Before we test the accuracy of our model on the test data set, we preprocess the test data set using the same procedure we used to preprocess the training data.

Preprocessing the test data set¶

# Use get_dummies to convert the Respondent sex categorical variable into 3 dummy/indicator variables
X_test_predictors = pd.get_dummies(X_test, columns=["cf_sex_cv"])

# Create two DataFrames, one for quantitative variables and one for qualitative variables
X_test_predictors_cat = X_test_predictors[['cf_sex_cv_1', 'cf_sex_cv_2', 'cf_sex_cv_3']]
X_test_predictors_cont = X_test_predictors[['cf_age', 'cf_scsf1', 'cf_clinvuln_dv']]

# Standardise the predictors using the StandardScaler function in sklearn
scaler = StandardScaler() # Initialising the scaler using the default arguments
X_test_predictors_cont_scale = scaler.fit_transform(X_test_predictors_cont) # Fit to continuous input variables and return the standardised dataset

# Use the concatenate function in Numpy to combine all variables — both categorical and continuous predictors — in one array
X_test_preprocessed = np.concatenate([X_test_predictors_cont_scale,X_test_predictors_cat], axis = 1)
X_test_preprocessed

array([[ 0.44686144, -1.63480238, -0.82552056,  1.        ,  0.        ,
0.        ],
[-0.41738323, -0.58244513, -0.82552056,  0.        ,  1.        ,
0.        ],
[-0.35565147, -0.58244513, -0.82552056,  1.        ,  0.        ,
0.        ],
...,
[ 0.01473911, -0.58244513, -0.82552056,  1.        ,  0.        ,
0.        ],
[ 1.37283788,  0.46991213,  0.7914319 ,  0.        ,  1.        ,
0.        ],
[-0.47911499, -1.63480238, -0.82552056,  1.        ,  0.        ,
0.        ]])


Predicting vaccine hesitancy¶

Use the predict function to predict who is likely to take up the COVID-19 vaccine or not using the test data.

y_pred_kNN = kNN_Classifier.predict(X_test_preprocessed)
y_pred_LogReg = LogReg_Classifier.predict(X_test_preprocessed)

y_pred_LogReg

array([1., 0., 1., ..., 1., 1., 1.])


Model evaluation on test data¶

Let’s evaluate the performance of our models predicting vaccination willingness using accuracy metric.

# Evaluate performance using the accuracy score for the logistic regresson model
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred_LogReg)

0.6245872491744984

# Evaluate performance using the accuracy score for the  k-nearest neighbors model
accuracy_score(y_test, y_pred_kNN)

0.5044450088900178


The accuracy scores are slightly lower on the test set compared to the accuracy score on the training set, indicating that the data split methodology mitigates the risk of overfitting.

Accuracy is a good metric when the positive class and the negative class are balanced. However, when one of the classes is a majority, as in our case, then a model can achieve a high accuracy by just predicting all observations to be the majority class. However, this is not what we want. In fact, in order to inform an information campaign about vaccination, we are more interested in predicting the minority class, people that are unlikely to take up the vaccine.

We can use a confusion matrix to further evaluate the performance of our classification models. The confusion matrix shows the number of respondents known to be in group 0 (unlikely to vaccinate) or 1 (likely to vaccinate) and predicted to be in group 0 or 1, respectively.

The confusion matrix below shows that the logistic model predicts 393 out of the 606 respondents who are unlikely to vaccinate. The model does much better job predicting respondents that are likely to vaccinate, 2066 out of 3331 respondents in the test data set.

# Confusion matrix for the logistic regression model plotted via Pandas crosstab() function

pd.crosstab(y_test.cf_vaxxer,y_pred_LogReg, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted 0.0 1.0 All
Actual
0.0 393 213 606
1.0 1265 2066 3331
All 1658 2279 3937

What do the numbers in the confusion matrix mean?

• True positive - our model correctly predicts the positive class (likely to vaccinate)

• True negative - our model correctly predicts the negative class (unlikely to vaccinate)

• False positive - our model incorrectly predicts the positive class

• False negative - our model incorrectly predicts the negative class

# Here is another representation of the confusion matrix using the scikit-learn confusion_matrix function
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred_LogReg)

array([[ 393,  213],
[1265, 2066]])

# The function ravel() flattens the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test,y_pred_LogReg).ravel()
print('True negative = ', tn, '\nFalse positive = ', fp, '\nFalse negative = ', fn, '\nTrue positive = ', tp)

True negative =  393
False positive =  213
False negative =  1265
True positive =  2066


For the k-nearest neighbors model, the confusion matrix below shows that the k-NN model predicts 316 out of the 606 respondents who are unlikely to vaccinate (the logistic regression model predicts more accurately those unlikely to vaccinate). The model does not predicts well respondents that are likely to vaccinate, 1670 out of 3331 respondents in the test data set.

Recall that we are less interested in predicting the majority class (likely to vaccinate). Instead, we are interested in predicting the minority class (unlikely to vaccinate) so that the results can inform an information campaign among people that are unlikely to vaccinate.

# Confusion matrix for the k-nearest neighbors model plotted via pandas function crosstab
pd.crosstab(y_test.cf_vaxxer,y_pred_kNN, rownames=['Actual'], colnames=['Predicted'], margins=True)

Predicted 0.0 1.0 All
Actual
0.0 316 290 606
1.0 1661 1670 3331
All 1977 1960 3937

Instead of relying on a single metric, it is often helpful (if not confusing) to compare various metrics. You can use the scikit-learn function classification_report to calculate various classification metrics, including precision and recall.

from sklearn.metrics import classification_report

# Various metrics for the logistic regression model
print(classification_report(y_test,y_pred_LogReg))

              precision    recall  f1-score   support

0.0       0.24      0.65      0.35       606
1.0       0.91      0.62      0.74      3331

accuracy                           0.62      3937
macro avg       0.57      0.63      0.54      3937
weighted avg       0.80      0.62      0.68      3937

# Various metrics for the k-nearest neighbors model
print(classification_report(y_test,y_pred_kNN))

              precision    recall  f1-score   support

0.0       0.16      0.52      0.24       606
1.0       0.85      0.50      0.63      3331

accuracy                           0.50      3937
macro avg       0.51      0.51      0.44      3937
weighted avg       0.75      0.50      0.57      3937


Overall, predicting accuracy of approximately 62% and 50% for the two models and the low predictive accuracy of the minority class (unlikely to vaccinate) indicate that the performance of our models is far from optimal. However, the purpose of this lab is not to build a well-performing model but to introduce you to an end-to-end machine learning workflow.

Keep in mind that it is not a good research practice to now — after you tested the models on the test data — go back and fine-tune the training models as this will introduce overfitting. A good research practice is to fine-tune and improve your model(s) at the stage of training and cross-validation (not after you tested your model on unseen data). Once you select your best performing model(s) at the cross-validation stage, you test the model using the test data and report the performance scores.

As part of your data analysis exercises, you will have another opportunity to build a new machine learning model and evaluate model performance.