Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Aug:2017:275-284.
doi: 10.1145/3097983.3098066.

The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Affiliations

The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables

Himabindu Lakkaraju et al. KDD. 2017 Aug.

Abstract

Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Selective labels problem.
Figure 2
Figure 2
Pictorial representation of contraction technique.
Figure 3
Figure 3
Effect of selective labels on estimation of predictive model failure rate (error bars denote standard errors): Green curve is the true failure rate of the predictive model and the machine evaluation using contraction (blue curve) follows it very closely. However, various imputation techniques heavily underestimate the failure rate. Based on the estimates of imputation, one would conclude that the predictive model outperforms human judges (red curve), while in fact its true performance is worse than that of the human judges.
Figure 4
Figure 4
Effect of unobservables (error bars denote standard errors): As we increase the influence of unobservable Z on the outcome Y, imputation techniques result in erroneous estimates of model performance. Contraction, on the other hand, produces reliable estimates.
Figure 5
Figure 5
Analyzing the effect of acceptance rate of most lenient decision-makers (left), agreement rate between the black box model and the most lenient decision-makers (center), and number of subjects judged by the most lenient decisionmakers (right) on the failure rate estimates obtained using contraction technique. Error bars denote standard errors.
Figure 6
Figure 6
Comparing the performance of human decision-makers and predictive models on bail (left), medical treatment (center), and insurance (right) datasets (error bars denote standard errors). Labeled outcomes only curve results in overoptimistic estimates. Contraction produces more accurate estimates of model performance

References

    1. Allison PD. Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology. 2002;55(1):193–196.
    1. Angrist JD, Imbens GW, Rubin DB. Identification of causal effects using instrumental variables. JASA. 1996;91(434):444–455.
    1. Angrist JD, Pischke J-S. Mostly harmless econometrics: An empiricist’s companion. Princeton university press; 2008.
    1. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research. 2011;46(3):399–424. - PMC - PubMed
    1. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–973. - PubMed