Learning from Data to Predict¶

Key themes¶

The prediction task
Supervised learning
Machine learning tasks — e.g., regression (continuous) and classification (binary)
Building and evaluation of simple prediction models
The problem of model overfitting and strategies to avoid it:
- Splitting the data into training set and testing set
- Cross-validation
Introduction to supervised machine learning algorithms, including k-Nearest Neighbors and Logistic Regression

Learning resources¶

Predictability of life trajectories by Matthew Salganik

Introduction to Machine Learning Methods by Susan Athey

Machine Learning with Scikit Learn by Jake VanderPlas

M Molina & F Garip. 2019. Machine learning for sociology. Annual Review of Sociology. Link to an open-access version of the article available at the Open Science Framework.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane. 2021. Chapter 7: Machine Learning. In Big Data and Social Science (2nd edition).

Aurélien Géron. 2019. Chapter 2: End-to-end Machine Learning project. In Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Edition). O’Reilly.

The Prediction task¶

Prediction is a data science task among other data science tasks, including description and causal inference. Prediction is the use of data to map some input (X) to an output (Y). The prediction task is called classification when the output variable is categorical (or discrete), and regression when it is continuous. Our focus in this session will be on classification.

Prediction tasks in social sciences¶

There are many prediction problems in social sciences (summarised in Kleinberg et al. 2015) that can benefit from (supervised) machine learning, for example:

In child protection, predicting when kids are in danger;
In the criminal justice system, predicting whether to detain or release arrestees as they await adjudication of their case (e.g., Kleinberg et al. 2015);
In population health, predicting suicides;
In education, predicting which teacher will have the greatest value add (e.g., Rockoff et al., 2011);
In higher education, predicting earlier university dropouts;
In labor market policy, predicting unemployment spell length to help workers decide on savings rates and job search strategies;
In social policy, predicting highest risk youth for targeting interventions (e.g., Chandler et al., 2011);
In sociology, predicting life outcomes (Salganik et al. 2020).

Predictions gone wrong¶

Prediction and machine learning models went wrong in a few occasions in different domains, including public health, education, the criminal justice system, and healthcare:

David Lazer et all. The Parable of Google Flu: Traps in Big Data Analysis. Science.
What went wrong with the A-level algorithm? Financial Times.
Why did the A-level algorithm say no? BBC.
Blame the politicians, not the technology, for A-level fiasco.. Financial Times.
Machine Bias. ProPublica.
Ziad Obermeyer at all. Dissecting racial bias in an algorithm used to manage the health of populations. Science.

Regardless of whether you use or not machine learning in your research, knowledge about prediction and machine learning techniques can help you evaluate how those techniques are used across domains and possibly identify ethical challenges and potential biases in those applications. Importantly, such data ethics challenges are found to reside not only in the machine learning algorithms themselves but in the entire data science ‘pipeline’ or ecosystem.

Supervised learning¶

Learn a model from labeled training data or outcome variable that would enable us to make predictions about unseen or future data. The learning is called supervised because the labels (e.g., email Spam or Ham where ‘Ham’ is e-mail that is not Spam) of the outcome variable (Y) that guide the learning process are already known.

Research problem: vaccine hesitancy¶

We will aim to predict people who are unlikely to take a coronavirus vaccine (Y) from socio-demographic and health input features (X). An unbiased prediction of individuals who are unlikely to vaccinate can inform targeted public health interventions, including information campaigns disseminating evidence-based information about Covid-19 vaccines.

Data: Understanding Society COVID-19¶

We will use data from The Understanding Society: Covid-19 Study. The survey asks participants across the UK about their experiences during the COVID-19 outbreak. We use the Wave 6 (November 2020) web-collected survey data. More information about the survey data and questionnaire is available in the study documentation on the UK Data Service website.

The data are safeguarded and is available to users registered with the UK Data Service.

Once access to the data is obtained, the data needs to be stored securely in your Google Drive and loaded in your private Colab notebook. The data is provided in various file formats, we use the .tab file format (tab files store data values separated by tabs) which can be easily loaded using pandas. The web collected data of the survey from Wave 6 (November 2020) is stored in the file cf_indresp_w.tab.

Note

The workflow in this session assumes that learners, first, have registered with the UK Data Service and obtained access to the Understanding Society: Covid-19 Study (Wave 6, November 2020, Web-collected data) and, second, have safely and securely stored the data in their Google Drive as a tab-separated values (TAB) file named cf_indresp_w.tab. If you have not registered with the UK Data Service and have not obtained access to the data, you can still read the textbook chapter and follow the analytical steps but would not be able to work interactively with the notebook.

Accessing data from your Google Drive¶

After you obtain access to the Understanding Society: Covid-19 Study, 2020, you can upload the Wave 6 (November 2020) data set into your Google Drive. Then you will need to connect your Google Drive to your Google Colab using the code below:

# Import the Drive helper
from google.colab import drive

# This will prompt for authorization.
# Enter your authorisation code and rerun the cell.
drive.mount("/content/drive")

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-71de872a6421> in <cell line: 2>()
      1 # Import the Drive helper
----> 2 from google.colab import drive
      3 
      4 # This will prompt for authorization.
      5 # Enter your authorisation code and rerun the cell.

ModuleNotFoundError: No module named 'google.colab'

Note

The above code will execute in Colab but will give an error (e.g., ModuleNotFoundError: No module named 'google') when the notebook is run outside Colab.

Loading the Understanding Society Covid-19 Study (Wave 6, November 2020, Web collected)¶

import pandas as pd
import numpy as np

# Load the Understadning Society COVID-19 Study web collected data, Wave 6
# Set the delimeter parameter sep to "\t" which indicates tabs
USocietyCovid = pd.read_csv(
    "/content/drive/My Drive/cf_indresp_w.tab",
    sep="\t",
)

# Display all columns in the Understanding Society: COVID-19 Study
pd.options.display.max_columns = None

USocietyCovid.head(0)  # display headings only as the data is safeguarded

	pidp	psu	strata	birthy	racel_dv	bornuk_dv	i_hidp	j_hidp	k_hidp	i_ioutcome	j_ioutcome	k_ioutcome	cf_welsh	cf_dobchk	cf_age	cf_sex_cv	cf_addrchk	cf_couplewsh	cf_hhnum	cf_personsexa	cf_personsexb	cf_personsexc	cf_personsexd	cf_personsexe	cf_personsexf	cf_personsexg	cf_personsexh	cf_personsexi	cf_personsexj	cf_personsexk	cf_personagea	cf_personageb	cf_personagec	cf_personaged	cf_personagee	cf_personagef	cf_personageg	cf_personageh	cf_personagei	cf_personagej	cf_personagek	cf_relationa	cf_relationb	cf_relationc	cf_relationd	cf_relatione	cf_relationf	cf_relationg	cf_relationh	cf_relationi	cf_relationj	cf_relationk	cf_couple	cf_hhcompa	cf_hhcompb	cf_hhcompc	cf_hhcompd	cf_hhcompe	cf_parent0plus	cf_parent5plus	cf_parent015	cf_parent1619	cf_parent511	cf_parent1217	cf_parent418	cf_scsf1	cf_ff_hadsymp	cf_clinvuln_dv	cf_hadsymp	cf_hassymp	cf_symptoms1	cf_symptoms2	cf_symptoms23	cf_symptoms24	cf_symptoms25	cf_symptoms3	cf_symptoms4	cf_symptoms5	cf_symptoms6	cf_symptoms7	cf_symptoms8	cf_symptoms9	cf_symptoms10	cf_symptoms12	cf_symptoms13	cf_symptoms14	cf_symptoms15	cf_symptoms16	cf_symptoms17	cf_symptoms18	cf_symptoms19	cf_symptoms20	cf_symptoms21	cf_symptoms22	cf_symptoms11	cf_cv19treat	cf_cv19trwhat1	cf_cv19trwhat2	cf_cv19trwhat3	cf_cv19trwhat4	cf_cv19trwhat5	cf_cv19trwhat6	cf_cv19trwhat7	cf_cv19trwhat8	cf_cv19trwhat9	cf_cv19trwhat10	cf_cv19trwhat11	cf_longcovid	cf_lgcvsymp1	cf_lgcvsymp23	cf_lgcvsymp24	cf_lgcvsymp25	cf_lgcvsymp3	cf_lgcvsymp4	cf_lgcvsymp5	cf_lgcvsymp6	cf_lgcvsymp7	cf_lgcvsymp8	cf_lgcvsymp9	cf_lgcvsymp10	cf_lgcvsymp12	cf_lgcvsymp13	cf_lgcvsymp14	cf_lgcvsymp15	cf_lgcvsymp16	cf_lgcvsymp17	cf_lgcvsymp18	cf_lgcvsymp19	cf_lgcvsymp20	cf_lgcvsymp21	cf_lgcvsymp22	cf_lgcvsymp26	cf_lgcvsymp_oth	cf_tested	cf_testresult	cf_testwhen_d	cf_testwhen_m	cf_testwhen_y	cf_hadcovid	cf_testtrace	cf_traceinfo	cf_traceinfoeng	cf_traced	cf_contactcv19t1	cf_contactcv19t5	cf_contactcv19t2	cf_contactcv19t3	cf_contactcv19t4	cf_riskcv19	cf_smartphone	cf_smarttype	cf_smartmodel	cf_covidapp	cf_whynotapp1	cf_whynotapp2	cf_whynotapp3	cf_whynotapp4	cf_whynotapp5	cf_whynotapp6	cf_whynotapp7	cf_whynotapp8	cf_whynotapp9	cf_whynotapp10	cf_whynotapp11	cf_whynotapp_oth	cf_whynotapporder	cf_covidappon	cf_covidappnot1	cf_covidappnot2	cf_covidappnot3	cf_covidappnot_oth	cf_hhsymp	cf_hhsympwho_persona	cf_hhsympwho_personb	cf_hhsympwho_personc	cf_hhsympwho_persond	cf_hhsympwho_persone	cf_hhsympwho_personf	cf_hhsympwho_persong	cf_hhsympwho_personh	cf_hhsympwho_personi	cf_hhsympwho_personj	cf_hhsympwho_personk	cf_hhsympwho_personl	cf_hhsympwho_personm	cf_hhsympwho_personn	cf_hhsympwho_persono	cf_hhsympwho_personp	cf_hhsympwho_personq	cf_hhsympwho_personr	cf_hhsympwho_persons	cf_hhsympwho_persont	cf_hhsympwho_personu	cf_hhsympwho_personv	cf_hhsympwho_personw	cf_hhsympwho_personx	cf_hhsympwho_persony	cf_hhsympwho_none	cf_hhtest	cf_hhtestwho_persona	cf_hhtestwho_personb	cf_hhtestwho_personc	cf_hhtestwho_persond	cf_hhtestwho_persone	cf_hhtestwho_personf	cf_hhtestwho_persong	cf_hhtestwho_personh	cf_hhtestwho_personi	cf_hhtestwho_personj	cf_hhtestwho_personk	cf_hhtestwho_personl	cf_hhtestwho_personm	cf_hhtestwho_personn	cf_hhtestwho_persono	cf_hhtestwho_personp	cf_hhtestwho_personq	cf_hhtestwho_personr	cf_hhtestwho_persons	cf_hhtestwho_persont	cf_hhtestwho_personu	cf_hhtestwho_personv	cf_hhtestwho_personw	cf_hhtestwho_personx	cf_hhtestwho_persony	cf_hhtestwho_none	cf_hhresult_persona	cf_hhresult_personb	cf_hhresult_personc	cf_hhresult_persond	cf_hhresult_persone	cf_hhresult_personf	cf_hhresult_persong	cf_hhresult_personh	cf_hhresult_personi	cf_hhresult_personj	cf_hhresult_personk	cf_hhresult_personl	cf_hhresult_personm	cf_hhresult_personn	cf_hhresult_persono	cf_hhresult_personp	cf_hhresult_personq	cf_hhresult_personr	cf_hhresult_persons	cf_hhresult_persont	cf_hhresult_personu	cf_hhresult_personv	cf_hhresult_personw	cf_hhresult_personx	cf_hhresult_persony	cf_hhtestwhen_d_persona	cf_hhtestwhen_m_persona	cf_hhtestwhen_y_persona	cf_hhtestwhen_d_personb	cf_hhtestwhen_m_personb	cf_hhtestwhen_y_personb	cf_hhtestwhen_d_personc	cf_hhtestwhen_m_personc	cf_hhtestwhen_y_personc	cf_hhtestwhen_d_persond	cf_hhtestwhen_m_persond	cf_hhtestwhen_y_persond	cf_hhtestwhen_d_persone	cf_hhtestwhen_m_persone	cf_hhtestwhen_y_persone	cf_hhtestwhen_d_personf	cf_hhtestwhen_m_personf	cf_hhtestwhen_y_personf	cf_hhtestwhen_d_persong	cf_hhtestwhen_m_persong	cf_hhtestwhen_y_persong	cf_hhtestwhen_d_personh	cf_hhtestwhen_m_personh	cf_hhtestwhen_y_personh	cf_hhtestwhen_d_personi	cf_hhtestwhen_m_personi	cf_hhtestwhen_y_personi	cf_hhtestwhen_d_personj	cf_hhtestwhen_m_personj	cf_hhtestwhen_y_personj	cf_hhtestwhen_d_personk	cf_hhtestwhen_m_personk	cf_hhtestwhen_y_personk	cf_hhtestwhen_d_personl	cf_hhtestwhen_m_personl	cf_hhtestwhen_y_personl	cf_hhtestwhen_d_personm	cf_hhtestwhen_m_personm	cf_hhtestwhen_y_personm	cf_hhtestwhen_d_personn	cf_hhtestwhen_m_personn	cf_hhtestwhen_y_personn	cf_hhtestwhen_d_persono	cf_hhtestwhen_m_persono	cf_hhtestwhen_y_persono	cf_hhtestwhen_d_personp	cf_hhtestwhen_m_personp	cf_hhtestwhen_y_personp	cf_hhtestwhen_d_personq	cf_hhtestwhen_m_personq	cf_hhtestwhen_y_personq	cf_hhtestwhen_d_personr	cf_hhtestwhen_m_personr	cf_hhtestwhen_y_personr	cf_hhtestwhen_d_persons	cf_hhtestwhen_m_persons	cf_hhtestwhen_y_persons	cf_hhtestwhen_d_persont	cf_hhtestwhen_m_persont	cf_hhtestwhen_y_persont	cf_hhtestwhen_d_personu	cf_hhtestwhen_m_personu	cf_hhtestwhen_y_personu	cf_hhtestwhen_d_personv	cf_hhtestwhen_m_personv	cf_hhtestwhen_y_personv	cf_hhtestwhen_d_personw	cf_hhtestwhen_m_personw	cf_hhtestwhen_y_personw	cf_hhtestwhen_d_personx	cf_hhtestwhen_m_personx	cf_hhtestwhen_y_personx	cf_hhtestwhen_d_persony	cf_hhtestwhen_m_persony	cf_hhtestwhen_y_persony	cf_fluinvite	cf_hadflujab	cf_noflujab	cf_nofluinvite	cf_fluinvite50	cf_flusoon	cf_vaxxer	cf_vaxno	cf_vaxwhy	cf_vaxpush1	cf_vaxpush2	cf_vaxpush3	cf_ff_hcond1	cf_ff_hcond2	cf_ff_hcond3	cf_ff_hcond4	cf_ff_hcond5	cf_ff_hcond6	cf_ff_hcond7	cf_ff_hcond8	cf_ff_hcond10	cf_ff_hcond11	cf_ff_hcond12	cf_ff_hcond13	cf_ff_hcond14	cf_ff_hcond15	cf_ff_hcond16	cf_ff_hcond18	cf_ff_hcond19	cf_ff_hcond21	cf_ff_hcond22	cf_ff_hcond23	cf_ff_hcond24	cf_ff_hcond27	cf_ff_hcondhas	cf_ff_pregnow	cf_ff_stillpreg	cf_hcondnew_cv1	cf_hcondnew_cv2	cf_hcondnew_cv3	cf_hcondnew_cv4	cf_hcondnew_cv5	cf_hcondnew_cv6	cf_hcondnew_cv7	cf_hcondnew_cv8	cf_hcondnew_cv11	cf_hcondnew_cv21	cf_hcondnew_cv10	cf_hcondnew_cv12	cf_hcondnew_cv13	cf_hcondnew_cv14	cf_hcondnew_cv15	cf_hcondnew_cv16	cf_hcondnew_cv22	cf_hcondnew_cv19	cf_hcondnew_cv23	cf_hcondnew_cv24	cf_hcondnew_cv27	cf_hcondnew_cv18	cf_hcondnew_cv96	cf_arthtypn	cf_cancertypn_cv1	cf_cancertypn_cv2	cf_cancertypn_cv3	cf_cancertypn_cv4	cf_cancertypn_cv5	cf_cancertypn_cv6	cf_cancertypn_cv8	cf_cancertypn_cv7	cf_mhealthtypn_cv2	cf_mhealthtypn_cv3	cf_mhealthtypn_cv4	cf_mhealthtypn_cv5	cf_mhealthtypn_cv6	cf_mhealthtypn_cv8	cf_mhealthtypn_cv9	cf_mhealthtypn_cv10	cf_mhealthtypn_cv11	cf_mhealthtypn_cv12	cf_mhealthtypn_cv13	cf_mhealthtypn_cv14	cf_mhealthtypn_cv15	cf_mhealthtypn_cv16	cf_mhealthtypn_cv17	cf_mhealthtypn_cv18	cf_mhealthtypn_cv19	cf_mhealthtypn_cv20	cf_hcond_treat1	cf_hcond_treat2	cf_hcond_treat3	cf_hcond_treat4	cf_hcond_treat5	cf_hcond_treat6	cf_treatment1	cf_treatment2	cf_treatment3	cf_treatment4	cf_treatment5	cf_canceltreat	cf_nhsnowgp2	cf_nhsnowpm2	cf_nhsnowop2	cf_nhsnowip2	cf_nhsnow1112	cf_chscnowpharm2	cf_chscnowotcm2	cf_chscnowcarer2	cf_chscnowpsy2	cf_pregnow	cf_stillpreg	cf_pregscan	cf_pregmidwife	cf_pregantenatal	cf_aidhh	cf_aidnum	cf_carehhc1	cf_carehhc2	cf_carehhc3	cf_carehhc4	cf_carehhc5	cf_carehhc6	cf_carehhwho1	cf_carehhwho2	cf_carehhwho3	cf_carehhwho4	cf_carehhwho5	cf_carehhwho6	cf_carehhwho7	cf_carehhwho8	cf_carehhsh	cf_aidhrs_cv	cf_respitenow	cf_caring	cf_carehow1	cf_carehow2	cf_carehow3	cf_carehow4	cf_carehow5	cf_carehow6	cf_carehow7	cf_carehow8	cf_carehow9	cf_carehow10	cf_carewho1	cf_carewho2	cf_carewho3	cf_carewho4	cf_carewho5	cf_carewho6	cf_carewho7	cf_carewho8	cf_help	cf_helpwhat1	cf_helpwhat2	cf_helpwhat3	cf_helpwhat4	cf_helpwhat5	cf_helpwhat6	cf_helpwhat7	cf_helpwhat8	cf_helpwhat9	cf_helpwhat10	cf_helpwho1	cf_helpwho2	cf_helpwho3	cf_helpwho4	cf_helpwho5	cf_helpwho6	cf_helpwho7	cf_helpwho8	cf_sclonely_cv	cf_f2fcontact	cf_f2fcontfreq	cf_phcontact	cf_smcontact	cf_whymove1	cf_whymove2	cf_whymove3	cf_whymove4	cf_whymove5	cf_whymove6	cf_whymove7	cf_whymove8	cf_whymove9	cf_whymove10	cf_whymove11	cf_whymove12	cf_whymove13	cf_hsownd_cv	cf_garden1	cf_garden2	cf_garden3	cf_garden4	cf_garden5	cf_garden6	cf_deskspace	cf_pcnet	cf_expmove	cf_whyexpmove1	cf_whyexpmove2	cf_whyexpmove3	cf_whyexpmove4	cf_whyexpmove5	cf_whyexpmove6	cf_whyexpmove7	cf_whyexpmove8	cf_whyexpmove9	cf_whyexpmove10	cf_whyexpmove11	cf_whyexpmove12	cf_whyexpmove13	cf_ff_semp	cf_ff_sempderived	cf_ff_hours	cf_ff_furlough	cf_ff_stillfurl	cf_ff_sempgovt	cf_ff_blwork	cf_sempchk	cf_semp	cf_sempderived	cf_hours	cf_hrschange112	cf_hrschange11	cf_hrschange12	cf_hrschange13	cf_hrschange14	cf_hrschange15	cf_hrschange116	cf_hrschange16	cf_hrschange113	cf_hrschange17	cf_hrschange18	cf_hrschange19	cf_hrschange114	cf_hrschange110	cf_hrschange115	cf_hrschange111	cf_hrschange28	cf_hrschange21	cf_hrschange22	cf_hrschange23	cf_hrschange24	cf_hrschange25	cf_hrschange29	cf_hrschange26	cf_hrschange210	cf_hrschange27	cf_hrschange315	cf_hrschange31	cf_hrschange32	cf_hrschange33	cf_hrschange34	cf_hrschange35	cf_hrschange319	cf_hrschange36	cf_hrschange316	cf_hrschange37	cf_hrschange38	cf_hrschange39	cf_hrschange310	cf_hrschange311	cf_hrschange312	cf_hrschange317	cf_hrschange313	cf_hrschange318	cf_hrschange314	cf_hrschangeup11	cf_hrschangeup12	cf_hrschangeup13	cf_hrschangeup14	cf_hrschangeup15	cf_hrschangeup16	cf_hrschangeup17	cf_hrschangeup18	cf_hrschangeup19	cf_hrschangeup110	cf_hrschangeup111	cf_hrschangeup112	cf_hrschangeup113	cf_stillfurl	cf_supprob6	cf_supprob8	cf_sempgovt2	cf_netpay_amount	cf_netpay_period	cf_grosspay_amount	cf_grosspay_period	cf_hhearners	cf_hhearn_amount	cf_hhearn_period	cf_hhincome_amount	cf_hhincome_period	cf_ghhincome_amount	cf_ghhincome_period	cf_wah	cf_wktrv_cv1	cf_wktrv_cv2	cf_wktrv_cv3	cf_wktrv_cv4	cf_wktrv_cv5	cf_wktrv_cv6	cf_wktrv_cv7	cf_wktrv_cv8	cf_wktrv_cv9	cf_wktrv_cv10	cf_wktrv_cv11	cf_wktrv_cv12	cf_wktrvfar_cv	cf_trcarfq_cv	cf_trbikefq_cv	cf_trwalkfq	cf_trbusfq_cv	cf_trtrnfq_cv	cf_trtubefq	cf_ff_ucredit	cf_ff_morhol	cf_ff_hsownd_cv	cf_ff_credithol	cf_ucreditb65	cf_ucredit2b65	cf_ucreditadvance65	cf_benefitsamt65	cf_transfers	cf_transfmade1	cf_transfmade2	cf_transfmade3	cf_transfmade4	cf_transfmade5	cf_transfmade6	cf_transfmade7	cf_transfrec1	cf_transfrec2	cf_transfrec3	cf_transfrec4	cf_transfrec5	cf_transfrec6	cf_transfrec7	cf_transfout	cf_transfin	cf_xphs_cv	cf_morhol3	cf_renthol	cf_xpbills_cv	cf_credithol	cf_creditholend	cf_creditholwhich1	cf_creditholwhich2	cf_creditholwhich3	cf_creditholwhich4	cf_creditholwhich5	cf_inoutflows1	cf_inoutflows2	cf_inoutflows9	cf_inoutflows3	cf_inoutflows4	cf_inoutflows5	cf_inoutflows6	cf_inoutflows10	cf_inoutflows7	cf_inoutflows8	cf_save_cv	cf_saved_cv	cf_debtnonmort	cf_debtamt	cf_debt2	cf_debt3	cf_spend	cf_ff_mpcalloc	cf_finnow	cf_finfut_cv3	cf_jobsec	cf_finsec	cf_mpc1	cf_mpc2	cf_mpc31	cf_mpc32	cf_mpc33	cf_mpc34	cf_mpc35	cf_mpc1b	cf_mpc2b	cf_mpc3b1	cf_mpc3b2	cf_mpc3b3	cf_mpc3b4	cf_scopngbhh_cv	cf_nbrcoh3	cf_nbrcoh2	cf_nbrcoh4	cf_scopngbhg	cf_crrace_cv	cf_num418	cf_ch418doba_y	cf_ch418dobb_y	cf_ch418dobc_y	cf_ch418dobd_y	cf_ch418dobe_y	cf_childage_childa	cf_childage_childb	cf_childage_childc	cf_childage_childd	cf_childage_childe	cf_schoollw_childa	cf_schoollw_childb	cf_schoollw_childc	cf_schoollw_childd	cf_schoollw_childe	cf_schoollwdys_childa	cf_schoollwdys_childb	cf_schoollwdys_childc	cf_schoollwdys_childd	cf_schoollwdys_childe	cf_schoollwhrs_childa	cf_schoollwhrs_childb	cf_schoollwhrs_childc	cf_schoollwhrs_childd	cf_schoollwhrs_childe	cf_schoollwgo_childa	cf_schoollwgo_childb	cf_schoollwgo_childc	cf_schoollwgo_childd	cf_schoollwgo_childe	cf_schoolgodys_childa	cf_schoolgodys_childb	cf_schoolgodys_childc	cf_schoolgodys_childd	cf_schoolgodys_childe	cf_schoolgohrs_childa	cf_schoolgohrs_childb	cf_schoolgohrs_childc	cf_schoolgohrs_childd	cf_schoolgohrs_childe	cf_schoolwork_childa	cf_schoolwork_childb	cf_schoolwork_childc	cf_schoolwork_childd	cf_schoolwork_childe	cf_lessonsoff_childa	cf_lessonsoff_childb	cf_lessonsoff_childc	cf_lessonsoff_childd	cf_lessonsoff_childe	cf_lessonson_childa	cf_lessonson_childb	cf_lessonson_childc	cf_lessonson_childd	cf_lessonson_childe	cf_hstime2_childa	cf_hstime2_childb	cf_hstime2_childc	cf_hstime2_childd	cf_hstime2_childe	cf_hshelp2_childa	cf_hshelp2_childb	cf_hshelp2_childc	cf_hshelp2_childd	cf_hshelp2_childe	cf_tutoring1_childa	cf_tutoring2_childa	cf_tutoring3_childa	cf_tutoring1_childb	cf_tutoring2_childb	cf_tutoring3_childb	cf_tutoring1_childc	cf_tutoring2_childc	cf_tutoring3_childc	cf_tutoring1_childd	cf_tutoring2_childd	cf_tutoring3_childd	cf_tutoring1_childe	cf_tutoring2_childe	cf_tutoring3_childe	cf_remedials_childa	cf_remedials_childb	cf_remedials_childc	cf_remedials_childd	cf_remedials_childe	cf_academprg_childa	cf_academprg_childb	cf_academprg_childc	cf_academprg_childd	cf_academprg_childe	cf_cospace1_childa	cf_cospace1_childb	cf_cospace1_childc	cf_cospace1_childd	cf_cospace1_childe	cf_cospace2_childa	cf_cospace2_childb	cf_cospace2_childc	cf_cospace2_childd	cf_cospace2_childe	cf_cospace3_childa	cf_cospace3_childb	cf_cospace3_childc	cf_cospace3_childd	cf_cospace3_childe	cf_cospace4_childa	cf_cospace4_childb	cf_cospace4_childc	cf_cospace4_childd	cf_cospace4_childe	cf_cospace5_childa	cf_cospace5_childb	cf_cospace5_childc	cf_cospace5_childd	cf_cospace5_childe	cf_sclfsato_cv	cf_scghqa	cf_scghqb	cf_scghqc	cf_scghqd	cf_scghqe	cf_scghqf	cf_scghqg	cf_scghqh	cf_scghqi	cf_scghqj	cf_scghqk	cf_scghql	cf_incentives	cf_noemailvoucher	cf_scghq1_dv	cf_scghq2_dv	cf_outcome	cf_lastq	cf_lastmodule	cf_link	cf_surveystart	cf_surveyend	cf_surveytime	cf_tsidcheckst	cf_tsidcheckend	cf_tshhrelst	cf_tshhrelend	cf_tssahst	cf_tssahend	cf_tscovidst	cf_tscovidend	cf_tshhcvst	cf_tshhcvend	cf_tsflust	cf_tsfluend	cf_tslthealthst	cf_tslthealthend	cf_tscaringhhst	cf_tscaringhhend	cf_tscareexhhst	cf_tscareexhhend	cf_tslonelyst	cf_tslonelyend	cf_tscffst	cf_tscffend	cf_tshsingst	cf_tshsingend	cf_tsempst	cf_tsempend	cf_tsttwst	cf_tsttwend	cf_tstranspst	cf_tstranspend	cf_tsfinst	cf_tsfinend	cf_tsfinsecst	cf_tsfinsecend	cf_tsnbhdst	cf_tsnbhdend	cf_tsnovschst	cf_tsnovschend	cf_tslfsatst	cf_tslfsatend	cf_tsghqst	cf_tsghqend	cf_tsclosest	cf_tscloseend	cf_screenres	cf_browserres	cf_useragentstring	cf_ff_prevsurv	cf_ff_intd	cf_ff_intm	cf_ff_inty	cf_ff_country	cf_gor_dv	cf_aid_dv	cf_betaindin_xw	cf_betaindin_xw_t	cf_betaindin_lw	cf_betaindin_lw_t1	cf_betaindin_lw_t2

USocietyCovid.shape

(12035, 916)

USocietyCovid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12035 entries, 0 to 12034
Columns: 916 entries, pidp to cf_betaindin_lw_t2
dtypes: float64(10), int64(855), object(51)
memory usage: 84.1+ MB

Defining Output and Input variables¶

Here are the Output and Input data features we will use in this session.

Outcome: Output (Y)¶

Description	Variable	Values
Likelihood of taking up a coronavirus vaccination	cf_vaxxer	1 = Very likely, 2 = Likely, 3 = Unlikely, 4 = Very unlikely

Predictors: Input features (X)¶

We select 4 (demographic and health-related) variables as examples only, no prior literature or expert knowledge is considered. We will discuss the role of prior literature and expert knowledge in the process of variable selection when we learn causal inference approaches.

Description	Variable	Values
Age	cf_age	Integer values (whole numbers)
Respondent sex	cf_sex_cv	1 = Male, 2 = Female, 3 = Prefer not to say
General health	cf_scsf1	1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor
At risk of serious illness from Covid-19	cf_clinvuln_dv	0 = no risk (not clinically vulnerable), 1 = moderate risk (clinically vulnerable), 2 = high risk (clinically extremely vulnerable)

Data wrangling¶

# Select output y and input X variables
USocietyCovid = USocietyCovid[
    ["cf_vaxxer", "cf_age", "cf_sex_cv", "cf_scsf1", "cf_clinvuln_dv"]
]
USocietyCovid.head()

	cf_vaxxer	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
0	2	37	2	2	0
1	3	35	1	4	0
2	3	55	2	2	0
3	1	38	1	3	1
4	1	67	2	2	0

import seaborn as sns

sns.set_context("notebook", font_scale=1.5)
%matplotlib inline

fig = sns.catplot(
    x="cf_vaxxer",
    kind="count",
    height=6,
    aspect=1.5,
    palette="ch:.25",
    data=USocietyCovid,
)

# Tweak the plot
(
    fig.set_axis_labels(
        "Likelihood of taking up a coronavirus vaccination", "Frequency"
    )
    .set_xticklabels(
        [
            "missing",
            "inapplicable",
            "refusal",
            "don't know",
            "Very likely",
            "Likely",
            "Unlikely",
            "Very unlikely",
        ]
    )
    .set_xticklabels(rotation=45)
)

<seaborn.axisgrid.FacetGrid at 0x7f9209efc670>

../_images/07_prediction_using_supervised_learning_22_1.png

Missing observations in Understanding Society are indicated by negative values. Let’s convert negative values to NaN using the function mask in pandas. An alternative approach would be to reload the data using the Pandas read_csv() function and provide the negative values as an argument to the parameter na_values, as a result of which Pandas will recognise these values as NaN.

# The function 'mask' in pandas replaces values where a condition is met.
USocietyCovid = USocietyCovid.mask(USocietyCovid < 0)
# Alternatively, you could replace negative values with another value, e.g., 0,
# using the code USocietyCovid.mask(USocietyCovid < 0, 0).
USocietyCovid

	cf_vaxxer	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
0	2.0	37	2	2.0	0.0
1	3.0	35	1	4.0	0.0
2	3.0	55	2	2.0	0.0
3	1.0	38	1	3.0	1.0
4	1.0	67	2	2.0	0.0
...	...	...	...	...	...
12030	1.0	57	1	2.0	0.0
12031	2.0	70	2	3.0	1.0
12032	2.0	64	1	2.0	0.0
12033	4.0	31	1	1.0	0.0
12034	3.0	41	2	3.0	0.0

12035 rows × 5 columns

# Remove NaN
USocietyCovid = USocietyCovid[
    ["cf_vaxxer", "cf_age", "cf_sex_cv", "cf_scsf1", "cf_clinvuln_dv"]
].dropna()

USocietyCovid

	cf_vaxxer	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
0	2.0	37	2	2.0	0.0
1	3.0	35	1	4.0	0.0
2	3.0	55	2	2.0	0.0
3	1.0	38	1	3.0	1.0
4	1.0	67	2	2.0	0.0
...	...	...	...	...	...
12030	1.0	57	1	2.0	0.0
12031	2.0	70	2	3.0	1.0
12032	2.0	64	1	2.0	0.0
12033	4.0	31	1	1.0	0.0
12034	3.0	41	2	3.0	0.0

11930 rows × 5 columns

# Plot the new cf_vaxxer (vaccination likelihood) variable

fig = sns.catplot(
    x="cf_vaxxer",
    kind="count",
    height=6,
    aspect=1.5,
    palette="ch:.25",
    data=USocietyCovid,
)

# Tweak the plot
(
    fig.set_axis_labels(
        "Likelihood of taking up a coronavirus vaccination", "Frequency"
    )
    .set_xticklabels(["Very likely", "Likely", "Unlikely", "Very unlikely"])
    .set_xticklabels(rotation=45)
)

<seaborn.axisgrid.FacetGrid at 0x7f92284700d0>

../_images/07_prediction_using_supervised_learning_26_1.png

To simplify the problem, we will recode cf_vaxxer (vaccination likelihood) variable into a binary variable where 1 refers to ‘Likely to take up a Covid-19 vaccine’ and 2 refers to ‘Unlikely to take up a Covid-19 vaccine’. To achieve this, we use the replace() method which replaces a set of values we specify (in our case, [1,2,3,4]) with another set of values we specify (in our case, [1,1,0,0]).

# Recode cf_vaxxer into a binary variable
USocietyCovid["cf_vaxxer"] = USocietyCovid["cf_vaxxer"].replace(
    [1, 2, 3, 4], [1, 1, 0, 0]
)
USocietyCovid.head()

	cf_vaxxer	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
0	1.0	37	2	2.0	0.0
1	0.0	35	1	4.0	0.0
2	0.0	55	2	2.0	0.0
3	1.0	38	1	3.0	1.0
4	1.0	67	2	2.0	0.0

# Plot the binary cf_vaxxer (vaccination likelihood) variable
fig = sns.catplot(
    x="cf_vaxxer",
    kind="count",
    height=6,
    aspect=1.5,
    palette="ch:.25",
    data=USocietyCovid,
)

# Tweak the plot
(
    fig.set_axis_labels(
        "Likelihood of taking up a coronavirus vaccination", "Frequency"
    )
    .set_xticklabels(["Unlikely", "Likely"])
    .set_xticklabels(rotation=45)
)

<seaborn.axisgrid.FacetGrid at 0x7f92586612b0>

../_images/07_prediction_using_supervised_learning_29_1.png

USocietyCovid.groupby("cf_vaxxer").count()

	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
cf_vaxxer
0.0	1837	1837	1837	1837
1.0	10093	10093	10093	10093

USocietyCovid.shape[0]

# 84.6% of respondents very likely or likely to take up a Covid vaccine
# and 15.4% very unlikely or unlikely
USocietyCovid.groupby("cf_vaxxer").count() / USocietyCovid.shape[0]

	cf_age	cf_sex_cv	cf_scsf1	cf_clinvuln_dv
cf_vaxxer
0.0	0.153982	0.153982	0.153982	0.153982
1.0	0.846018	0.846018	0.846018	0.846018

So far, we have described our outcome variable to make sense of the task but we have neither looked at the predictor variables nor examined any relationships between predictor variables and outcomes. It is a good practice to first split the data into training set and test set and only then explore predictors and relationships in the training set.

Overfitting and data splitting¶

The problem of model overfitting¶

Overfitting occurs when model captures ‘noise’ in a specific sample while failing to recognise general patterns across samples. As a result of overfitting, the model produces accurate predictions for examples from the sample at hand but will predict poorly new examples the model has never seen.

Training set, Validation set, and Test set¶

To avoid overfitting, data is typically split into three groups:

Training set — used to train models
Validation set — used to tune the model and estimate model performance/accuracy for best model selection
Test set - used to evaluate the generalisability of the model to new observations the model has never seen

If your data set is not large enough, a possible strategy, which we will use here, is to split the data into training set and test set, and use cross-validation on the training set to evaluate our models’ performance/accuracy. We will use 2/3 of the data to train the predictive model and the remaining 1/3 to create the test set.

# Split train and test data

from sklearn.model_selection import train_test_split

# Outcome variable
y = USocietyCovid[["cf_vaxxer"]]

# Predictor variables
X = USocietyCovid[["cf_age", "cf_sex_cv", "cf_scsf1", "cf_clinvuln_dv"]]

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, stratify=y, random_state=0
)

print("Train data", X_train.shape, "\n" "Test data", X_test.shape)

Train data (7993, 4) 
Test data (3937, 4)

Preprocessing the training data set¶

Categorical predictors — dummy variables¶

Categorical variables are often encoded using numeric values. For example, Respondent sex is recorded as 1 = Men, 2 = Female, 3 = Prefer not to say. The numeric values can be ‘misinterpreted’ by the algorithms — the value of 1 is obviously less than the value of 3 but that does not correspond to real-world numerical differences.

A solution is to convert categorical predictors into dummy variables. Basically, each category value is converted into a new column and assigns a 1 or 0 (True/False) values using the function get_dummies in pandas. The function creates dummy/indicator variables that contain value of 1 or 0.

The Respondent sex variable is converted below is three columns of 1s or 0s corresponding to the respective value.

# Use get_dummies to convert the Respondent sex categorical variable into
# 3 dummy/indicator variables
X_train_predictors = pd.get_dummies(X_train, columns=["cf_sex_cv"])
X_train_predictors.head()

	cf_age	cf_scsf1	cf_clinvuln_dv	cf_sex_cv_2
994	28	2.0	0.0	1
11376	41	5.0	0.0	1
9730	77	3.0	1.0	1
8235	41	3.0	2.0	1
1406	60	2.0	0.0	1

# Create two DataFrames, one for numerical variables
# and one for categorical variables
X_train_predictors_cat = X_train_predictors[
    ["cf_sex_cv_1", "cf_sex_cv_2", "cf_sex_cv_3"]
]
X_train_predictors_cont = X_train_predictors[["cf_age", "cf_scsf1", "cf_clinvuln_dv"]]

Continuous predictors — standardisation¶

We standardise the continuous input variables.

# Standardise the predictors using the StandardScaler function in sklearn
from sklearn.preprocessing import StandardScaler  # For standartising data

scaler = StandardScaler()  # Initialising the scaler using the default arguments
X_train_predictors_cont_scale = scaler.fit_transform(
    X_train_predictors_cont
)  # Fit to continuous input variables and return the standardised dataset
X_train_predictors_cont_scale

array([[-1.65608796, -0.56081614, -0.80754698],
       [-0.84497135,  2.64868513, -0.80754698],
       [ 1.40119771,  0.50901761,  0.81873748],
       ...,
       [-0.96975852, -1.6306499 , -0.80754698],
       [ 0.46529394,  0.50901761,  2.44502193],
       [ 0.77726186,  0.50901761,  0.81873748]])

Combine categorical and continuous predictors into one data array¶

# Use the concatenate function in Numpy to combine all variables
# (both categorical and continuous predictors) in one array
X_train_preprocessed = np.concatenate(
    [X_train_predictors_cont_scale, X_train_predictors_cat], axis=1
)
X_train_preprocessed

array([[-1.65608796, -0.56081614, -0.80754698,  0.        ,  1.        ,
         0.        ],
       [-0.84497135,  2.64868513, -0.80754698,  0.        ,  1.        ,
         0.        ],
       [ 1.40119771,  0.50901761,  0.81873748,  0.        ,  1.        ,
         0.        ],
       ...,
       [-0.96975852, -1.6306499 , -0.80754698,  0.        ,  1.        ,
         0.        ],
       [ 0.46529394,  0.50901761,  2.44502193,  0.        ,  1.        ,
         0.        ],
       [ 0.77726186,  0.50901761,  0.81873748,  1.        ,  0.        ,
         0.        ]])

X_train_preprocessed.shape

(7993, 6)

Unbalance class problem¶

In the case of the vaccination likelihood question, one of the classes (likely to vaccinate) has a significantly greater proportion of cases (84.6%) than the other case (unlikely to vaccinate) (15.4%). We therefore face an unbalanced class problem.

Different methods to mitigate the problem exist. We will use a method called ADASYN: Adaptive Synthetic Sampling Method for Imbalanced Data. The method oversamples the minority class in the training data set until both classes have an equal number of observations. Hence, the data set we use to train our models contains two balanced classes.

from imblearn.over_sampling import ADASYN

# Initialization of the ADASYN resampling method; set random_state for reproducibility
adasyn = ADASYN(random_state=0)

# Fit the ADASYN resampling method to the train data
X_train_balance, y_train_balance = adasyn.fit_resample(X_train_preprocessed, y_train)

The resulting X_train_balance and y_train_balance now include both the original data and the resampled data. The y_train_balance now includes an almost equal number of labels for each class.

# Now that the two classes are balanced, the train data
# is ~14K observations, greater than the original ~8K.
X_train_balance.shape

(13611, 6)

Hands-on mini-exercise¶

Verify that after the oversampling the y_train_balance data object contains indeed approximately equal number of observations for both classes, those likely to vaccinate (1) and those unlikely to vaccinate (0).

Note that y_train_balance is a NumPy array. You can check that on your own using the function type(), for example: type(y_train).

(y_train_balance == 0).sum()

cf_vaxxer    6849
dtype: int64

(y_train_balance == 1).sum()

cf_vaxxer    6762
dtype: int64

Train models on training data¶

We fit two classifiers — k-Nearest Neighbours (k-NN) and Logistic Regression — on the training data. The k-Nearest Neighbours classifier (k-NN) and Logistic Regression classifier are two widely used classifiers. Our focus is on the end-to-end workflow so we do not discuss the workings of the two classifiers in detail. To learn more about the two classifiers, see Python Data Science Handbook by Jake VanderPlas (on k-NN) and the DataCamp course Supervised Learning with scikit-learn.

In the models below, we use the default hyperparameters (hyperparameter are parameters that are not learned from data but are set by the researcher to guide the learning process) for both classifiers.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Create an instance of k-nearest neighbors (k-NN) classifier.
# We set the hyperparameter n_neighbors=5 meaning that
# the label of an unknown respondent (0 or 1) is a function of
# the labels of its five closest training respondents.
kNN_Classifier = KNeighborsClassifier(n_neighbors=5)

# Create an instance of Logistic Regression Classifier
LogReg_Classifier = LogisticRegression()

# Fit both models to the training data
kNN_Classifier.fit(X_train_balance, y_train_balance.cf_vaxxer)
LogReg_Classifier.fit(X_train_balance, y_train_balance.cf_vaxxer)

LogisticRegression()

Model evaluation using Cross-validation¶

Now that your two models are fitted, you can evaluate the accuracy of their prediction. In older approaches, the prediction accuracy is often calculated on the same set of training data used to fit the model. The problem of such an approach is that the model can ‘memorise’ the training data and show high prediction accuracy on that data set while failing to perform well on new data. For this reason, approaches in data science, and machine learning in particular, prefer to evaluate the prediction accuracy of a model on new data that has not been used in training the model.

The cross-validation technique¶

Cross-validation is a technique for assessing accuracy of model prediction without relying on in-sample prediction. We will split our training set into k equal folds or parts. The number of folds can differ but for simplicity we will consider 5-fold cross-validation. How does 5-fold cross-validation work? While keeping aside one fold (or part), we fit the model with the remaining four folds and use the fitted model to predict outcomes of observations in fold one and on this basis compute model prediction accuracy. We repeat the procedure for all 5 folds or parts of the data and compute the average prediction accuracy.

Metrics to evaluate model performance¶

Many metrics to evaluate model performance exist. We evaluate model performance using the accuracy score. The accuracy score is the simplest metric for evaluating classification models. Simply put, accuracy is the proportion of predictions our model got right. Keep in mind, however, that because of the unbalanced class problem, accuracy may not be the best metric in our case. Because one of our classes accounts for 84.6% of the cases, even a model that uniformly predicts that all respondents are likely to take up the vaccine will obtain very high accuracy of 0.846 while being useless for identifying respondents who are unlikely to take up the vaccine. We will return to this problem shortly.

# Import the function cross_val_score() which performs
# cross-validation and evaluates the model using a score.
# Many scores are available to evaluate a classification mdoel,
# as a starting point, we select the simplest one called accuracy.
from sklearn.model_selection import cross_val_score

# Evaluate the kNN_Classifier model via 5-fold cross-validation
kNN_score = cross_val_score(
    kNN_Classifier, X_train_balance, y_train_balance.cf_vaxxer, cv=5, scoring="accuracy"
)
kNN_score

array([0.6232097 , 0.65686995, 0.64070536, 0.65980896, 0.64731815])

# Take the mean across the five accuracy scores
kNN_score.mean() * 100

64.55824239753719

# Repeat for our logistic regression model
LogReg_score = cross_val_score(
    LogReg_Classifier,
    X_train_balance,
    y_train_balance.cf_vaxxer,
    cv=5,
    scoring="accuracy",
)
LogReg_score.mean() * 100

62.21431553077533

The output from the cross-validation technique shows that the performance of our two models is comparable as measured by the accuracy score.

At this stage, we could fine-tune model hyperparameters — i.e., parameters that the model does not learn from data, e.g., the number of k neighbours in the k-NN algorithm — and re-evaluate model performance. During the process of model validation, we do not use the test data. Once we are happy with how our model(s) perform, we test the model on unseen data.

Testing model accuracy on new data¶

Before we test the accuracy of our model on the test data set, we preprocess the test data set using the same procedure we used to preprocess the training data.

Preprocessing the test data set¶

# Use get_dummies to convert the respondent sex categorical variable
# into 3 dummy/indicator variables.
X_test_predictors = pd.get_dummies(X_test, columns=["cf_sex_cv"])

# Create two DataFrames, one for quantitative variables and one for qualitative variables
X_test_predictors_cat = X_test_predictors[["cf_sex_cv_1", "cf_sex_cv_2", "cf_sex_cv_3"]]
X_test_predictors_cont = X_test_predictors[["cf_age", "cf_scsf1", "cf_clinvuln_dv"]]

# Standardise the predictors using the StandardScaler function in sklearn
scaler = StandardScaler()  # Initialising the scaler using the default arguments
X_test_predictors_cont_scale = scaler.fit_transform(
    X_test_predictors_cont
)  # Fit to continuous input variables and return the standardised dataset

# Use the concatenate function in Numpy to combine all variables
# (both categorical and continuous predictors) in one array
X_test_preprocessed = np.concatenate(
    [X_test_predictors_cont_scale, X_test_predictors_cat], axis=1
)
X_test_preprocessed

array([[ 0.44686144, -1.63480238, -0.82552056,  1.        ,  0.        ,
         0.        ],
       [-0.41738323, -0.58244513, -0.82552056,  0.        ,  1.        ,
         0.        ],
       [-0.35565147, -0.58244513, -0.82552056,  1.        ,  0.        ,
         0.        ],
       ...,
       [ 0.01473911, -0.58244513, -0.82552056,  1.        ,  0.        ,
         0.        ],
       [ 1.37283788,  0.46991213,  0.7914319 ,  0.        ,  1.        ,
         0.        ],
       [-0.47911499, -1.63480238, -0.82552056,  1.        ,  0.        ,
         0.        ]])

Predicting vaccine hesitancy¶

Use the predict function to predict who is likely to take up the COVID-19 vaccine or not using the test data.

y_pred_kNN = kNN_Classifier.predict(X_test_preprocessed)
y_pred_LogReg = LogReg_Classifier.predict(X_test_preprocessed)

y_pred_LogReg

array([1., 0., 1., ..., 1., 1., 1.])

Model evaluation on test data¶

Let’s evaluate the performance of our models predicting vaccination willingness using accuracy metric.

# Evaluate performance using the accuracy score for the logistic regresson model
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred_LogReg)

0.6245872491744984

# Evaluate performance using the accuracy score for the  k-nearest neighbors model
accuracy_score(y_test, y_pred_kNN)

0.5044450088900178

The accuracy scores are slightly lower on the test set compared to the accuracy score on the training set, indicating that the data split methodology mitigates the risk of overfitting.

Accuracy is a good metric when the positive class and the negative class are balanced. However, when one of the classes is a majority, as in our case, then a model can achieve a high accuracy by just predicting all observations to be the majority class. However, this is not what we want. In fact, in order to inform an information campaign about vaccination, we are more interested in predicting the minority class, people that are unlikely to take up the vaccine.

We can use a confusion matrix to further evaluate the performance of our classification models. The confusion matrix shows the number of respondents known to be in group 0 (unlikely to vaccinate) or 1 (likely to vaccinate) and predicted to be in group 0 or 1, respectively.

The confusion matrix below shows that the logistic model predicts 393 out of the 606 respondents who are unlikely to vaccinate. The model does much better job predicting respondents that are likely to vaccinate, 2066 out of 3331 respondents in the test data set.

# Confusion matrix for the logistic regression model
# plotted via the Pandas crosstab() function

pd.crosstab(
    y_test.cf_vaxxer,
    y_pred_LogReg,
    rownames=["Actual"],
    colnames=["Predicted"],
    margins=True,
)

Predicted	0.0	1.0	All
Actual
0.0	393	213	606
1.0	1265	2066	3331
All	1658	2279	3937

What do the numbers in the confusion matrix mean?

True positive - our model correctly predicts the positive class (likely to vaccinate)
True negative - our model correctly predicts the negative class (unlikely to vaccinate)
False positive - our model incorrectly predicts the positive class
False negative - our model incorrectly predicts the negative class

# Here is another representation of the confusion matrix
# using the scikit-learn `confusion_matrix` function
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_LogReg)

array([[ 393,  213],
       [1265, 2066]])

# The function ravel() flattens the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_LogReg).ravel()
print(
    "True negative = ",
    tn,
    "\nFalse positive = ",
    fp,
    "\nFalse negative = ",
    fn,
    "\nTrue positive = ",
    tp,
)

True negative =  393 
False positive =  213 
False negative =  1265 
True positive =  2066

For the k-nearest neighbors model, the confusion matrix below shows that the k-NN model predicts 316 out of the 606 respondents who are unlikely to vaccinate (the logistic regression model predicts more accurately those unlikely to vaccinate). The model does not predicts well respondents that are likely to vaccinate, 1670 out of 3331 respondents in the test data set.

Recall that we are less interested in predicting the majority class (likely to vaccinate). Instead, we are interested in predicting the minority class (unlikely to vaccinate) so that the results can inform an information campaign among people that are unlikely to vaccinate.

# Confusion matrix for the k-nearest neighbors model
# plotted via pandas function crosstab
pd.crosstab(
    y_test.cf_vaxxer,
    y_pred_kNN,
    rownames=["Actual"],
    colnames=["Predicted"],
    margins=True,
)

Predicted	0.0	1.0	All
Actual
0.0	316	290	606
1.0	1661	1670	3331
All	1977	1960	3937

Instead of relying on a single metric, it is often helpful (if not confusing) to compare various metrics. You can use the scikit-learn function classification_report to calculate various classification metrics, including precision and recall.

from sklearn.metrics import classification_report

# Various metrics for the logistic regression model
print(classification_report(y_test, y_pred_LogReg))

              precision    recall  f1-score   support

         0.0       0.24      0.65      0.35       606
         1.0       0.91      0.62      0.74      3331

    accuracy                           0.62      3937
   macro avg       0.57      0.63      0.54      3937
weighted avg       0.80      0.62      0.68      3937

# Various metrics for the k-nearest neighbors model
print(classification_report(y_test, y_pred_kNN))

              precision    recall  f1-score   support

         0.0       0.16      0.52      0.24       606
         1.0       0.85      0.50      0.63      3331

    accuracy                           0.50      3937
   macro avg       0.51      0.51      0.44      3937
weighted avg       0.75      0.50      0.57      3937

Overall, predicting accuracy of approximately 62% and 50% for the two models and the low predictive accuracy of the minority class (unlikely to vaccinate) indicate that the performance of our models is far from optimal. However, the purpose of this lab is not to build a well-performing model but to introduce you to an end-to-end machine learning workflow.

Keep in mind that it is not a good research practice to now — after you tested the models on the test data — go back and fine-tune the training models as this will introduce overfitting. A good research practice is to fine-tune and improve your model(s) at the stage of training and cross-validation (not after you tested your model on unseen data). Once you select your best performing model(s) at the cross-validation stage, you test the model using the test data and report the performance scores.

As part of your data analysis exercises, you will have another opportunity to build a new machine learning model and evaluate model performance.