Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr:2020:151-159.
doi: 10.1145/3368555.3384468.

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Affiliations

Hidden Stratification Causes Clinically Meaningful Failures in Machine Learning for Medical Imaging

Luke Oakden-Rayner et al. Proc ACM Conf Health Inference Learn (2020). 2020 Apr.

Abstract

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

Keywords: Computing methodologies → Machine learning; convolutional neural networks; hidden stratification; machine learning.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Performance of a ResNeXt-29, 8x64d on CIFAR-100 superclasses by (a) true (semantic) CIFAR-100 subclass and (b) random CIFAR-100 subclasses. Random subclasses were assigned by randomly permuting the subclass label assignments within each superclass. Most superclasses contain true subclasses where performance is far lower than that on the aggregate superclass. Intra-subclass performance variance on random subclasses is on average 66% lower than on the true (semantic) subclasses, indicating that the stratification observed in practice is substantially higher than would be expected from randomness alone.
Figure 2:
Figure 2:
ROC curves for subclasses of the (a) abnormal Adelaide Hip Fracture superclass (b) abnormal MURA superclass and (c) pneumothorax CXR14 superclass. All subclass AUCs are significantly different than the overall task (DeLong p<0.05) for MURA and CXR14. For hip fracture, the AUCs themselves are not statistically different via a two-sided test (DeLong p>0.05), but the sensitivities are statistically different (p<0.01) at the relevant operating point [15]—see Table 2 for details. For MURA, sensitivities at 0.50 specificity are 0.93 (All), 1.00 (Hardware), 0.89 (Fracture), 0.80 (Degenerative). For CXR14, sensitivities at 0.50 specificity are 0.94 (All), 0.99 (Drain), and 0.85 (No Drain). For hip fracture, sensitivities at 0.50 specificity are 1.00 (All), 1.00 (Cervical), and 0.95 (Subtle)

References

    1. Agniel Denis, Kohane Isaac S, and Weber Griffin M. 2018 Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ 361 (April 2018), k1479. - PMC - PubMed
    1. Badgeley Marcus A, Zech John R, Oakden-Rayner Luke, Glicksberg Benjamin S, Liu Manway, Gale William, McConnell Michael V, Percha Bethany, Snyder Thomas M, and Dudley Joel T. 2019 Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit Med 2 (April 2019), 31. - PMC - PubMed
    1. Bien Nicholas, Rajpurkar Pranav, Robyn L Ball Jeremy Irvin, Park Allison, Jones Erik, Bereket Michael, Patel Bhavik N, Yeom Kristen W, Shpanskaya Katie, Halabi Safwan, Zucker Evan, Fanton Gary, Amanatullah Derek F, Beaulieu Christopher F, Riley Geoffrey M, Stewart Russell J, Blankenberg Francis G, Larson David B, Jones Ricky H, Langlotz Curtis P, Ng Andrew Y, and Lungren Matthew P. 2018 Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med 15, 11 (November 2018), e1002699. - PMC - PubMed
    1. Buda Mateusz, Maki Atsuto, and Mazurowski Maciej A. 2018 A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw 106 (October 2018), 249–259. - PubMed
    1. Caliński Tadeusz and Jerzy Harabasz. 1974 A dendrite method for cluster analysis. Communications in Statistics-Theory and Methods 3, 1 (1974), 1–27.

LinkOut - more resources