Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 22;31(1):274-280.
doi: 10.1093/jamia/ocad178.

A framework for understanding label leakage in machine learning for health care

Affiliations

A framework for understanding label leakage in machine learning for health care

Sharon E Davis et al. J Am Med Inform Assoc. .

Abstract

Introduction: The pitfalls of label leakage, contamination of model input features with outcome information, are well established. Unfortunately, avoiding label leakage in clinical prediction models requires more nuance than the common advice of applying "no time machine rule."

Framework: We provide a framework for contemplating whether and when model features pose leakage concerns by considering the cadence, perspective, and applicability of predictions. To ground these concepts, we use real-world clinical models to highlight examples of appropriate and inappropriate label leakage in practice.

Recommendations: Finally, we provide recommendations to support clinical and technical stakeholders as they evaluate the leakage tradeoffs associated with model design, development, and implementation decisions. By providing common language and dimensions to consider when designing models, we hope the clinical prediction community will be better prepared to develop statistically valid and clinically useful machine learning models.

Keywords: clinical prediction; clinical utility; label leakage.

PubMed Disclaimer

Conflict of interest statement

M.P.S. and S.B. are inventors of intellectual property licensed by Duke University to Clinetic, Inc, and Cohere-Med, Inc. M.P.S. and S.B. hold equity in Clinetic, Inc. M.E.M. and S.E.D. have no conflicts of interest to disclose.

Figures

Figure 1.
Figure 1.
Framework for evaluating label leakage over key temporal periods and their interaction with model contextual factors of cadence of predictions, perspective of predictions, and facets of prediction applicability.
Figure 2.
Figure 2.
Example use cases of the label leakage evaluation framework applied to cross-sectional models. Abbreviations: ED, Emergency Department; DNR, do not resuscitate order.
Figure 3.
Figure 3.
Example use cases of the label leakage evaluation framework applied to cohort models.

References

    1. Michael Matheny STI, Mahnoor Ahmed, Danielle Whicher, eds. Artificial Intelligence in Healthcare: The Hope, the Hype, the Promise, the Peril. 12/2019 ed. Washington, DC: National Academies Press (US; ); 2019.
    1. Sendak M, Gao M, Nichols M, Lin A, Balu S.. Machine learning in health care: a critical appraisal of challenges and opportunities. EGEMS (Wash DC). 2019;7(1):1. 10.5334/egems.287 - DOI - PMC - PubMed
    1. Kaufman S, Rosset S, Perlich C, Stitelman O.. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data. 2012;6(4):1.
    1. Chiavegatto Filho A, Batista AFM, Dos Santos HG.. Data leakage in health outcomes prediction with machine learning. comment on “prediction of incident hypertension within the next year: prospective study using statewide electronic health records and machine learning”. J Med Internet Res. 2021;23(2):e10969. - PMC - PubMed
    1. Bedoya AD, Futoma J, Clement ME, et al. Machine learning for early detection of sepsis: an internal and temporal validation study. JAMIA Open. 2020;3(2):252-260. - PMC - PubMed