Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Luke Oakden-Rayner¹, Jared Dunnmon², Gustavo Carneiro¹, Christopher Ré²

Affiliations

¹ Australian Institute for Machine Learning, University of Adelaide, Adelaide, Australia.
² Department of Computer Science, Stanford University, Stanford, California, USA.

PMID: 33196064
PMCID: PMC7665161
DOI: 10.1145/3368555.3384468

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Luke Oakden-Rayner et al. Proc ACM Conf Health Inference Learn (2020). 2020 Apr.

. 2020 Apr:2020:151-159.

doi: 10.1145/3368555.3384468.

Authors

Luke Oakden-Rayner¹, Jared Dunnmon², Gustavo Carneiro¹, Christopher Ré²

Affiliations

¹ Australian Institute for Machine Learning, University of Adelaide, Adelaide, Australia.
² Department of Computer Science, Stanford University, Stanford, California, USA.

PMID: 33196064
PMCID: PMC7665161
DOI: 10.1145/3368555.3384468

Abstract

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

Keywords: Computing methodologies → Machine learning; convolutional neural networks; hidden stratification; machine learning.

PubMed Disclaimer

Figures

**Figure 1:**
Performance of a ResNeXt-29, 8x64d on CIFAR-100 superclasses by (a) true (semantic) CIFAR-100 subclass and (b) random CIFAR-100 subclasses. Random subclasses were assigned by randomly permuting the subclass label assignments within each superclass. Most superclasses contain true subclasses where performance is far lower than that on the aggregate superclass. Intra-subclass performance variance on random subclasses is on average 66% lower than on the true (semantic) subclasses, indicating that the stratification observed in practice is substantially higher than would be expected from randomness alone.

**Figure 2:**
ROC curves for subclasses of the (a) abnormal Adelaide Hip Fracture superclass (b) abnormal MURA superclass and (c) pneumothorax CXR14 superclass. All subclass AUCs are significantly different than the overall task (DeLong p<0.05) for MURA and CXR14. For hip fracture, the AUCs themselves are not statistically different via a two-sided test (DeLong p>0.05), but the sensitivities are statistically different (p<0.01) at the relevant operating point [15]—see Table 2 for details. For MURA, sensitivities at 0.50 specificity are 0.93 (All), 1.00 (Hardware), 0.89 (Fracture), 0.80 (Degenerative). For CXR14, sensitivities at 0.50 specificity are 0.94 (All), 0.99 (Drain), and 0.85 (No Drain). For hip fracture, sensitivities at 0.50 specificity are 1.00 (All), 1.00 (Cervical), and 0.95 (Subtle)

See this image and copyright information in PMC

References

1. Agniel Denis, Kohane Isaac S, and Weber Griffin M. 2018 Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361 (April 2018), k1479. - PMC - PubMed
1. Badgeley Marcus A, Zech John R, Oakden-Rayner Luke, Glicksberg Benjamin S, Liu Manway, Gale William, McConnell Michael V, Percha Bethany, Snyder Thomas M, and Dudley Joel T. 2019 Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med 2 (April 2019), 31. - PMC - PubMed
1. Bien Nicholas, Rajpurkar Pranav, Robyn L Ball Jeremy Irvin, Park Allison, Jones Erik, Bereket Michael, Patel Bhavik N, Yeom Kristen W, Shpanskaya Katie, Halabi Safwan, Zucker Evan, Fanton Gary, Amanatullah Derek F, Beaulieu Christopher F, Riley Geoffrey M, Stewart Russell J, Blankenberg Francis G, Larson David B, Jones Ricky H, Langlotz Curtis P, Ng Andrew Y, and Lungren Matthew P. 2018 Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med 15, 11 (November 2018), e1002699. - PMC - PubMed
1. Buda Mateusz, Maki Atsuto, and Mazurowski Maciej A. 2018 A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106 (October 2018), 249–259. - PubMed
1. Caliński Tadeusz and Jerzy Harabasz. 1974 A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3, 1 (1974), 1–27.

Grants and funding

U54 EB020405/EB/NIBIB NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Affiliations

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources