Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Jun 12:2023.06.07.23291119.
doi: 10.1101/2023.06.07.23291119.

Fostering transparent medical image AI via an image-text foundation model grounded in medical literature

Affiliations

Fostering transparent medical image AI via an image-text foundation model grounded in medical literature

Chanwoo Kim et al. medRxiv. .

Update in

Abstract

Building trustworthy and transparent image-based medical AI systems requires the ability to interrogate data and models at all stages of the development pipeline: from training models to post-deployment monitoring. Ideally, the data and associated AI systems could be described using terms already familiar to physicians, but this requires medical datasets densely annotated with semantically meaningful concepts. Here, we present a foundation model approach, named MONET (Medical cONcept rETriever), which learns how to connect medical images with text and generates dense concept annotations to enable tasks in AI transparency from model auditing to model interpretation. Dermatology provides a demanding use case for the versatility of MONET, due to the heterogeneity in diseases, skin tones, and imaging modalities. We trained MONET on the basis of 105,550 dermatological images paired with natural language descriptions from a large collection of medical literature. MONET can accurately annotate concepts across dermatology images as verified by board-certified dermatologists, outperforming supervised models built on previously concept-annotated dermatology datasets. We demonstrate how MONET enables AI transparency across the entire AI development pipeline from dataset auditing to model auditing to building inherently interpretable models.

PubMed Disclaimer

Conflict of interest statement

Competing interests R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA, and VisualDx for consulting; stock options from MDAcne and Revea for advisory board; and research funding from UCB.

Figures

Fig. 1 |
Fig. 1 |. Overview of MONET framework and its usage examples.
(A) Training procedure. MONET is trained using contrastive learning on an extensive set of dermatology image and text pairs collected from PubMed articles and medical textbooks. During the training process, the paired image and text are forced to be close in the joint representation space, while those from different pairs are forced to be far apart. (B) Automatic concept generation. MONET can map medical concepts and images onto a joint representation space, allowing it to determine the degree to which a concept is present in an image for any given concept by measuring the distance between the image and concept text prompts in the representation space. Its concept generation capability enables various concept-driven analyses at multiple stages of the medical AI pipeline. (C) Concept-level data auditing. MONET’s automatic concept generation capability makes it possible to explain the distinguishing features between two sets of data in the language of human-interpretable concepts. This approach facilitates the auditing of large-scale datasets with ease. (D) Concept-level model auditing. MONET can be used to identify which input characteristic leads to the errors of medical AI. (E) Developing inherently interpretable models. MONET can be used to develop inherently interpretable medical AI models that operate on human-interpretable concepts aligning with physicians’ expectations. These models allow physicians to easily decipher the factors influencing the models’ decisions, ensuring high transparency.
Fig. 2 |
Fig. 2 |. Images with high concept presence scores calculated using MONET.
The concept presence score represents the degree to which a concept is present in an image. Each row displays the top 10 images for each concept. (A) Clinical images from the Fitzpatrick17k and DDI datasets. We exclude images inappropriate for public display due to the inclusion of sensitive body parts; for completeness, we denote the filenames of these files in Supplementary Table 1 (B) Dermoscopy images from the ISIC dataset.
Fig. 3 |
Fig. 3 |. Concept-level data auditing.
(A) We perform concept differential analysis between malignant images and benign images in the ISIC dataset. We show the top 10 concepts with positive values and the top 5 concepts with negative values. A positive value means the concept was more present in the malignant images than in the benign images, and vice versa. (B) We perform concept differential analysis between malignant and benign images per data source in the ISIC dataset to identify data-source-specific trends. The purple bar represents the output from the Medical University of Vienna, and the green bar represents the output from the Hospital Clinic de Barcelona. We show the top 15 concepts based on their absolute differences between the two cohorts. (C) Examples of red images in each cohort. We display 10 randomly selected images from the top 100 images in each cohort that had high concept expression scores for redness. (D) Precision-recall curve for images in each cohort. The images in each cohort are sorted based on their concept presence scores for redness and then compared to their malignancy labels. Precision is defined as the proportion of malignant images above a certain threshold out of all images above that threshold, while recall rate is defined as the proportion of malignant images above the threshold out of all malignant images. The top 500 and top 1000 red images from Barcelona Hospital still contain more malignant than benign samples.
Fig. 4 |
Fig. 4 |. Concept-level model auditing.
(A) We perform a benchmark analysis to see how well “model auditing with MONET” (MA-MONET) can identify the semantically meaningful concepts that lead to model error. To this end, we generate settings where we know the ground truth (i.e., concepts that lead to model errors); we create a training and test dataset with spurious correlation. We use MA-MONET to identify which concepts lead to model error for an AI model trained on this confounded dataset. MA-MONET returns a ranked list of concepts that explain model errors. (B) The frequency of the known spurious correlation being recovered by MA-MONET is shown. (D)-(E) Each row displays one of the top 5 clusters, sorted by high error rates. For each cluster, we show the misclassified images and the corresponding concepts associated with errors. We represent the true and predicted labels for each image by the color of the upper left and lower right triangles in the small box, respectively. The numbers at the top right compare the number of malignant and benign samples for the true and predicted labels. The 5 misclassified images shown for each are selected based on the average concept presence of the identified concepts.
Fig. 5 |
Fig. 5 |. Concept bottleneck model.
(A) Concept bottleneck model built using concepts generated by MONET (blue). The model first generates concepts using MONET and then predicts disease labels by combining them via a linear model. Concept bottleneck model built using concepts manually labeled by experts (green). The model uses manually annotated concept labels to predict disease labels using a linear model. Manual annotations take a lot longer than concept generation using MONET. (B)-(C) Performance of a malignancy prediction model trained using manual labels with respect to the number of concepts and the number of expert-labeled samples. (D)-(E) Performance of a melanoma prediction model trained using manual labels with respect to the number of concepts and the number of expert-labeled samples. (B)-(E) MONET+CBM is shown as a cross mark because it can utilize all concepts without expert annotation. The shaded area represents the 95% confidence interval. (F) Performance comparison of malignancy prediction models. (G) Performance comparison of melanoma prediction models. (F)-(G) Unlike (B)-(E), MONET+CBM uses task-relevant concepts curated by dermatologists. Each dot represents the AUC measure for individual runs with a different train-test split. The box represents the interquartile range with its lower and upper bounds corresponding to the first quartile and third quartile, respectively. p values derived from one-sided paired t-tests comparing MONET+CBM and other methods are indicated: *<0.05, **<0.01, ***<0.001; n=20 runs of each method. (H) Coefficient of the linear model in MONET+CBM for malignancy prediction. (I) Coefficient of the linear model in MONET+CBM for melanoma prediction. (H)-(I) The error bars present the 95% confidence interval.

References

    1. Daneshjou R., Yuksekgonul M., Cai Z. R., Novoa R. & Zou J. Y. SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained debugging and analysis in Advances in Neural Information Processing Systems (eds Koyejo S. et al. ) 35 (Curran Associates, Inc., 2022), 18157–18167.
    1. Goel K., Gu A., Li Y. & Ré C. Model Patching: Closing the Subgroup Performance Gap with Data Augmentation in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (OpenReview.net, 2021).
    1. Sagawa S., Koh P. W., Hashimoto T. B. & Liang P. Distributionally Robust Neural Networks in International Conference on Learning Representations (2020).
    1. Rajpurkar P. et al. MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs May 22, 2018. arXiv: 1712.06957[physics].
    1. Oakden-Rayner L., Dunnmon J., Carneiro G. & Re C. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging in Proceedings of the ACM Conference on Health, Inference, and Learning ACM CHIL ‘20: ACM Conference on Health, Inference, and Learning (ACM, Toronto Ontario Canada, Apr. 2, 2020), 151–159. ISBN: 978-1-4503-7046-2. - PMC - PubMed

Publication types