Data Ethics, Bias, and Fairness

Topics of data ethics, privacy, and fairness should be at the beginning of a data science textbook, not at the end. Indeed, next drafts should integrate ethical challenges earlier in the textbook. Data ethics is the last topic in this version for two reasons. First, an informed discussion of data ethics issues requires to first build some intuitions about the data science lifecycle and what can go wrong in that lifecycle, including different sources of bias. Learning how to evaluate models in general is a precondition for evaluating models for bias and fairness. Second, moving beyond the issues of data ethics to possible mitigation strategies also requires some exposure to approaches we covered earlier in the course, for example causal inference. Finally, although our focused discussion on data ethics comes last, we discuss issues of data design and biases (e.g., algorithmic confounding), transparency and reproducibility, and other topics of ethical data science throughout the textbook.

Key themes

  • Ethical challenges, principles, and frameworks.

  • Privacy and consent.

  • Detecting and dealing with bias and fairness in data science models.

  • Implications of data science biases: discrimination, profiling, social inequalities.

Learning resources

Jeremy Howard, Sylvain Gugger, and Rachel Thomas. Chapter 3: Data Ethics. In Deep Learning for Coders with fastai and PyTorch: AI Applications Without a PhD. Online version freely available.
    Associated video lecture Getting Specific About Algorithmic Bias by Rachel Thomas. PyBay 2019. Slides are available here.

Mathew Salganik. Chapter 6: Ethics. In Bit by Bit: Social Research in the Digital Age. Online version freely available.
    Associated video lectures Ethics and Computational Social Science by Mathew Salganik, Part 1 and Part 2.

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, Julia Lane. Big Data and Social Science (2nd edition). Chapter 11: Bias and Fairness. Online version freely available.

Sendhil Mullainathan. 2019. Biased Algorithms Are Easier to Fix Than Biased People. New York Times.

Kelleher and Tierney. Chapter 6: Privacy and Ethics. In Data Science. MIT Press.

Mason A. Porter. Data Ethics. Video lecture and slides available here.

Pedro Saleiro, Kit T. Rodolfa, Rayid Ghani. Dealing with Bias and Fairness in Data Science Systems: A Practical Hands-on Tutorial.
    Corresponding video tutorial, KDD 2020 tutorial.

Discussion: A-level results in 2020 England

This week we will discuss a data ethics topic instead of doing hands-on coding.

Our discussion topic: What could have gone wrong with the A-level results in England in 2020?

  • Algorithm or algorithms?

  • Equations?

  • Data science pipeline?

  • Historical data?

  • Design (e.g., inclusion and exclusion criteria)?

  • Conflicting notions of fairness (e.g., individual fairness, group fairness)?

  • Individual decision-making?

  • Politics?

Learning resources about the A-level controversy

Sean Coughlan. 2020. Why did the A-level algorithm say no? BBC.

Chris Giles. 2020. What went wrong with the A-level algorithm? Financial Times.

FT’s Editorial Board. 2020. Blame the politicians, not the technology, for A-level fiasco. Financial Times.

Bias and Fairness in Data Science Systems

Once we come up with discussion points and hypotheses about what could have gone wrong with the A-level controversy, we will review this toolbox for ‘Dealing with Bias and Fairness in Data Science Systems’ and will determine which discussion points and hypotheses could (or could not) be addressed by such a tool for debiasing data science systems.

Pedro Saleiro, Kit T. Rodolfa, Rayid Ghani. Dealing with Bias and Fairness in Data Science Systems: A Practical Hands-on Tutorial. See also the Corresponding video tutorial, KDD 2020 tutorial.

While exploring the toolbox, focus in particular on:

  • Individuals’ attributes you would use to evaluate for fairness the A-level data science system, and the reference group to which you would compare.

  • Fairness metrics that you would use to evaluate the A-level model results.