The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Himabindu Lakkaraju¹, Jon Kleinberg², Jure Leskovec¹, Jens Ludwig³, Sendhil Mullainathan⁴

Affiliations

¹ Stanford University.
² Cornell University.
³ University of Chicago.
⁴ Harvard University.

PMID: 29780658
PMCID: PMC5958915
DOI: 10.1145/3097983.3098066

The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Himabindu Lakkaraju et al. KDD. 2017 Aug.

. 2017 Aug:2017:275-284.

doi: 10.1145/3097983.3098066.

Authors

Himabindu Lakkaraju¹, Jon Kleinberg², Jure Leskovec¹, Jens Ludwig³, Sendhil Mullainathan⁴

Affiliations

¹ Stanford University.
² Cornell University.
³ University of Chicago.
⁴ Harvard University.

PMID: 29780658
PMCID: PMC5958915
DOI: 10.1145/3097983.3098066

Abstract

Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

PubMed Disclaimer

Figures

**Figure 2**
Pictorial representation of contraction technique.

**Figure 3**
Effect of selective labels on estimation of predictive model failure rate (error bars denote standard errors): Green curve is the true failure rate of the predictive model and the machine evaluation using contraction (blue curve) follows it very closely. However, various imputation techniques heavily underestimate the failure rate. Based on the estimates of imputation, one would conclude that the predictive model outperforms human judges (red curve), while in fact its true performance is worse than that of the human judges.

**Figure 4**
Effect of unobservables (error bars denote standard errors): As we increase the influence of unobservable Z on the outcome Y, imputation techniques result in erroneous estimates of model performance. Contraction, on the other hand, produces reliable estimates.

**Figure 5**
Analyzing the effect of acceptance rate of most lenient decision-makers (left), agreement rate between the black box model and the most lenient decision-makers (center), and number of subjects judged by the most lenient decisionmakers (right) on the failure rate estimates obtained using contraction technique. Error bars denote standard errors.

**Figure 6**
Comparing the performance of human decision-makers and predictive models on bail (left), medical treatment (center), and insurance (right) datasets (error bars denote standard errors). *Labeled outcomes only* curve results in overoptimistic estimates. *Contraction* produces more accurate estimates of model performance

See this image and copyright information in PMC

References

1. Allison PD. Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology. 2002;55(1):193–196.
1. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. JASA. 1996;91(434):444–455.
1. Angrist JD, Pischke J-S. Mostly harmless econometrics: An empiricist’s companion. Princeton university press; 2008.
1. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research. 2011;46(3):399–424. - PMC - PubMed
1. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–973. - PubMed

Grants and funding

U54 EB020405/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Affiliations

The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources