Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 6;13(1):1867.
doi: 10.1038/s41467-022-29437-8.

Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model

Affiliations

Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model

Doyun Kim et al. Nat Commun. .

Abstract

The inability to accurately, efficiently label large, open-access medical imaging datasets limits the widespread implementation of artificial intelligence models in healthcare. There have been few attempts, however, to automate the annotation of such public databases; one approach, for example, focused on labor-intensive, manual labeling of subsets of these datasets to be used to train new models. In this study, we describe a method for standardized, automated labeling based on similarity to a previously validated, explainable AI (xAI) model-derived-atlas, for which the user can specify a quantitative threshold for a desired level of accuracy (the probability-of-similarity, pSim metric). We show that our xAI model, by calculating the pSim values for each clinical output label based on comparison to its training-set derived reference atlas, can automatically label the external datasets to a user-selected, high level of accuracy, equaling or exceeding that of human experts. We additionally show that, by fine-tuning the original model using the automatically labelled exams for retraining, performance can be preserved or improved, resulting in a highly accurate, more generalized model.

PubMed Disclaimer

Conflict of interest statement

M.H.L. is a consultant for GE Healthcare and for the Takeda, Roche, and Seagen Pharmaceutical Companies, and has received institutional research support from Siemens Healthcare. B.P.L. and J.B.A. receive royalties from Elsevier, Inc. as an associate academic textbook editor and author. S.D. is a consultant of Doai and received research support from Tplus and Medibloc. M.K.K. has received institutional research support from Siemens Healthineers, Coreline Inc., and Riverain Tech Inc. J.M.C. was partially supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) funded by the Ministry of Health & Welfare, Republic of Korea (HI19C1057). The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. System overview.
Standardized, automated labeling method, based on similarity to a previously validated five-label chest X-ray (CXR) detection explainable AI (xAI) model, using an xAI model-derived-atlas based approach. a Our quantitative model-derived atlas-based explainable AI system calculates a probability-of-similarity (pSim) value for automated labeling, based on the harmonic mean between the patch similarity and the confidence. The resulting pSim metric can be applied to a “mode selection” algorithm, to either label the external input images to a selected threshold-of-confidence, or alert the user that the pSim value falls below this selected threshold. b The model-derived atlas-based method calculates patch similarity and confidence, based on class activation mapping (CAM), and predicted probability from the model, for each clinical output label. c The harmonic mean between the patch similarity and confidence is then used to calculate a pSim for each clinical output label in mode selection.
Fig. 2
Fig. 2. Automated-labeling model performance applied to three open-source CXR datasets, compared to consensus ground truth of seven expert radiologists, for the cardiomegaly & pleural effusion image labels.
We applied our xAI CXR auto-labeling model to three large open-source datasets: CheXpert, MIMIC, and NIH. For two of the five clinical output labels (cardiomegaly & pleural effusion), we randomly selected a subset of “positive” and “negative” cases as determined by the model, distributed equally in each of ten pSim value ranges (0–0.1, 0.1–0.2, 0.2–0.3, …, 0.9–1.0), for expert review. In a, the positive (light red) and negative (light blue) ratings for each of the seven individual readers (columns A–G) are displayed graphically, with the consensus ground truth (GT, determined by majority) shown in the last column (bold red or bold blue). In b, the positive predictive values (PPV = [true positive by GT]/[total positive by model], solid red triangles, y-axis left) and negative predictive values (NPV = [true negative by GT]/[total negative by model], solid blue circles, y-axis left), of the model’s ratings, are graphed versus the pSim threshold value that was applied by the model (x-axis). Also displayed in b (y-axis right) are the model’s true positive capture rate (TPCR, dotted red triangles) and true negative capture rate (TNCR, dotted blue circles), defined respectively as TPCR = [true positive (TP) by GT]/[total positive by GT (number bold red from a)] and TNCR = [true negative (TN) by GT]/[total negative by GT (number bold blue from a)]. In c (lower left) and d (lower right), respectively, the number of false positive (FP by GT) and false negative (FN by GT) cases rated by the model at each pSim threshold value (x-axis), are shown stratified by dataset (CheXpert, MIMIC, or NIH; total number cases positive or negative by the model in parentheses), with the optimal, lowest pSim threshold achieving 100% PPV or NPV, as indicated (bold green triangles).
Fig. 3
Fig. 3. Automated-labeling model performance applied to three open-source CXR datasets, compared to consensus ground truth of seven expert radiologists, applied to the pulmonary edema and pneumonia labels.
Please refer to Fig. 2 for ad captions.
Fig. 4
Fig. 4. Automated-labeling model performance applied to three open-source CXR datasets, compared to consensus ground truth of seven expert radiologists, applied to the atelectasis label.
Please refer to Fig. 2 for ad captions.
Fig. 5
Fig. 5. AUROC performance of automated-labeling model at two different pSim threshold values, compared to sensitivity, specificity of individual expert radiologists, and pooled public labels from three open-source CXR datasets.
AUROC performance of our xAI CXR auto-labeling model applied to the CheXpert, MIMIC, and NIH open-source datasets, is shown for each of the five labeled clinical output labels: a cardiomegaly, b pleural effusion, c pulmonary edema, d pneumonia, and e atelectasis. Comparison is to the performance of the individual expert radiologists (A–G, red circles), as well as to the performance of the pooled external annotations (blue squares, n = number available labeled external cases per clinical output label). ROC curves (y-axis sensitivity, x-axis 1-specificity) are shown for both the baseline pSim = 0 threshold (magnified box) and the optimal pSim threshold (i.e., the lowest pSim threshold achieving 100% accuracy, as per Figs. 2–4c and d).
Fig. 6
Fig. 6. Comparison of labeling efficiency/confidence metrics for each of the 5 clinical output labels.
For each of the five auto-labeled clinical output labels– cardiomegaly (blue), pleural effusion (orange), atelectasis (gray), pulmonary edema (green), and pneumonia (yellow)—we compared: (i) the percent of positively auto-labeled CXR’s “captured” from the three pooled, full public datasets (i.e., “Pooled Capture%”, from Supplementary Table 3, C); (ii) the percent of cases with complete agreement between the model and all seven expert readers (i.e., “Full Agree%”, from Supplementary Fig. 2); (iii) the lowest pSim value such that PPV = 1 (graphed as “1-pSim”, from Figs. 2–4, c), and (iv) the lowest pSim value such that NPV = 1 (graphed as “1-pSim”, from Figs. 2–4, d). clinical output labels with higher y-axis values (e.g., cardiomegaly, pleural effusion) correspond to those with greater model auto-labeling efficiency/confidence; clinical output labels with lower y-axis values (e.g., pneumonia, pulmonary edema) correspond to those with lesser model auto-labeling efficiency/confidence. Of note, in the graph for atelectasis, “1-pSim@PPV1” is higher than “1-pSim@NPV1”, which can be interpreted as greater confidence that the model is correct in “ruling-in” the clinical output label (i.e., correctly auto-labeling true-positives) than in “ruling-out” the clinical output label (i.e., correctly auto-labeling true-negatives); this relationship is reversed for the other four clinical output labels (e.g., greater confidence that the model can correctly “rule-out” than “rule-in” pneumonia or pulmonary edema).
Fig. 7
Fig. 7. Pairwise kappa statistics between the seven expert radiologists, for each of the five clinical output labels.
For each of the five auto-labeled clinical output labels—a cardiomegaly, pleural effusion, c pulmonary edema, d pneumonia, and e atelectasis—the pairwise kappa statistics estimating inter-observer variability are shown in the respective color-coded matrices.
Fig. 8
Fig. 8. Performance comparison of Confidence Probability, Patch Similarity, and pSim in assigning true-positive model output labels for cardiomegaly, pleural effusion, pulmonary edema, pneumonia, and atelectasis.
We compared the true positive capture rate (TPCR) performance for each of the five clinical output labels, using confidence probability alone (reflecting the global probability distribution of the output labels), patch similarity alone (reflecting the focal spatial localization of the output labels), and pSim (reflecting the harmonic mean between the confidence probability and patch similarity, as per Fig. 1). These results are noteworthy in that the two model output labels that reflect high inter-rater agreement of imaging findings—a cardiomegaly and b pleural effusion, as per Fig. 7—show good agreement between the three confidence-level metrics, with high TPCR’s for each. For the two output labels that show lower inter-rater agreement per Fig. 7—c pulmonary edema and d pneumonia—pSim performance significantly exceeds that of patch similarity for both, and that of confidence probability for pneumonia but not pulmonary edema. This difference is likely attributable to the fact that patch similarity is more sensitive for the detection of focal, regional imaging findings (e.g., as seen with the clinical diagnosis of pneumonia), whereas confidence probability is more sensitive for the detection of global findings (e.g., as seen with the clinical diagnosis of pulmonary edema). The results for e atelectasis, typically a more focal than global finding on CXR, may be similarly explained.

References

    1. Lee H, et al. An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. Nat. Biomed. Eng. 2019;3:173–182. doi: 10.1038/s41551-018-0324-9. - DOI - PubMed
    1. Irvin, J. et al. Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence33, 590–597 (2019).
    1. Johnson, A., et al. MIMIC-CXR-JPG - chest radiographs with structured labels (version 2.0.0). PhysioNet10.13026/8360-t248 (2019).
    1. Wang, X., et al. Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2097–2106 (2017).
    1. Bustos A, Pertusa A, Salinas JM, de la Iglesia-Vayá M. Padchest: a large chest x-ray image dataset with multi-label annotated reports. Med. Image Anal. 2020;66:101797. doi: 10.1016/j.media.2020.101797. - DOI - PubMed

Publication types