Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 May 16:2023.05.12.23289878.
doi: 10.1101/2023.05.12.23289878.

Dissection of medical AI reasoning processes via physician and generative-AI collaboration

Affiliations

Dissection of medical AI reasoning processes via physician and generative-AI collaboration

Alex J DeGrave et al. medRxiv. .

Abstract

Despite the proliferation and clinical deployment of artificial intelligence (AI)-based medical software devices, most remain black boxes that are uninterpretable to key stakeholders including patients, physicians, and even the developers of the devices. Here, we present a general model auditing framework that combines insights from medical experts with a highly expressive form of explainable AI that leverages generative models, to understand the reasoning processes of AI devices. We then apply this framework to generate the first thorough, medically interpretable picture of the reasoning processes of machine-learning-based medical image AI. In our synergistic framework, a generative model first renders "counterfactual" medical images, which in essence visually represent the reasoning process of a medical AI device, and then physicians translate these counterfactual images to medically meaningful features. As our use case, we audit five high-profile AI devices in dermatology, an area of particular interest since dermatology AI devices are beginning to achieve deployment globally. We reveal how dermatology AI devices rely both on features used by human dermatologists, such as lesional pigmentation patterns, as well as multiple, previously unreported, potentially undesirable features, such as background skin texture and image color balance. Our study also sets a precedent for the rigorous application of explainable AI to understand AI in any specialized domain and provides a means for practitioners, clinicians, and regulators to uncloak AI's powerful but previously enigmatic reasoning processes in a medically understandable way.

PubMed Disclaimer

Conflict of interest statement

R.D. reports fees from L’Oreal, Frazier Healthcare Partners, Pfizer, DWA, and VisualDx for consulting; stock options from MDAcne and Revea for advisory board; and research funding from UCB.

Figures

Fig. 1 |
Fig. 1 |. Overview of joint expert, XAI auditing procedure and audited AI devices.
a, Our auditing procedure unites explainable AI with analysis by human experts to understand medical AI devices. Specifically, we leverage generative models to create counterfactual images that alter the prediction a medical AI device; analysis of the counterfactuals by human experts (dermatologists) reveals the medical AI device’s reasoning processes. We perform the analysis on numerous images from each of multiple datasets, gathering insights from two experts, for each of five different dermatology AI devices. b, Key details of dermatology AI devices audited in this study. c, Performance of the dermatology AI devices on three datasets, including a dataset (DDI) external to the training data of every device. We examine the area under the receiver operating characteristic curve (ROC-AUC) to focus on the model’s internal reasoning processes rather than emphasize the authors’ original choices of model calibration. Asan, Atlas, and Hallym datasets described in ref.; MED-NODE is described in ref.; Edinburgh is available at https://licensing.edinburgh-innovations.ed.ac.uk/product/dermofit-image-library *ROC-AUC<0.5 (i.e., worse than random performance).
Fig. 2 |
Fig. 2 |. Joint expert, XAI auditing procedure reveals reasoning processes of dermatology AI devices.
a, Given a reference image and an AI device to investigate, our generative model produces “benign” and “malignant” counterfactuals, which resemble the reference image but differ in one or more attributes (e.g., pigmentation, solid arrows, and dots on the background skin, open arrows). When evaluated by the AI device, the counterfactuals’ outputs lie on opposite sides of the decision threshold. Higher values indicate greater likelihood of malignancy, as predicted by an AI device (Scanoma). b, To obtain robust conclusions, dermatology experts evaluate numerous counterfactuals after pre-screening and randomization of the images. c, Attributes identified by our joint expert-XAI auditing procedure as key influences on the output of dermatology AI devices. For each attribute/device pair, we count the proportion of counterfactual pairs in which experts noted that attribute differs; we display the global top-10 attributes as determined by lowest rank-sum over all AI devices. Based on expert evaluation of whether the attribute was present to a greater extent in the malignant or benign counterfactual of each pair, we determine whether that attribute was “predominant” in benign or malignant counterfactuals, i.e., present to greater extent in benign (malignant) counterfactuals in at least twice as many images as malignant (benign) counterfactuals. The size of each square is then determined as the number of counterfactual pairs with a difference noted in the predominant direction. For comparison, we specify how human dermatologists use each attribute (“Literature”), based on our review of the literature combined with expert opinion from two board-certified dermatologists; see Discussion for additional information. Bar charts indicate Cohen’s κ values for agreement between each expert and the AI device, where each is asked which image in each counterfactual pair appeared more likely to be malignant. “L”, lesion; “B”, background. d, Examples of counterfactuals that differ in each of the top ten attributes identified in the ISIC data; the attribute is present to a greater extent in the right image of each pair. For conciseness, some attribute names were shortened; refer to Supplementary Table 1 for full names. Images adapted with permission from ref. Combalia et al., ref. Tschandl et al., and ref. Codella et al.
Fig. 3 |
Fig. 3 |. Experimental validation of findings from expert analysis of counterfactual images.
a, Frequency with which experts noted that either the benign or malignant image in a pair of counterfactuals displayed a pinker background; this view details our observations from the ISIC dataset summarized in Fig. 2c, in the row “B: pinker”. The vertical axis is normalized relative to the maximum observed frequency, that is, 42% of counterfactual pairs from SIIM-ISIC. b, Experimental setup used to verify the importance of a pink tint to the AI devices’ predictions. We programmatically color-shifted each image in the ISIC dataset (n = 20260) by modifying its chromaticity coordinates in the CIELUV color space (see Methods), then compared each AI device’s predictions between the original and color-shifted images. c, Sensitivity of each AI device to programmatic color shifts, mirroring observations from our counterfactual experiments regarding the effect of pinker tints on the AI devices’ predictions. The vertical axis is normalized relative to the maximum change in AI device output, i.e., a decrease of 0.17 with DeepDerm. Vertical dashed lines indicate the mean change in chromaticity (color) among counterfactual pairs annotated as differing in their pink tone. Example color-shifted images (below color bar) display the extent of the color shift; the reference image, adapted with permission from the ISIC archive, appears at far left.
Fig. 4 |
Fig. 4 |. Explanations of failure cases of dermatology AI devices, illustrating key findings from our systematic analysis.
a, Presence of atypical pigment networks (black arrows) and darker pigmentation (white arrows) contributed to a false positive prediction from Scanoma. b, Curiously, ModelDerm may have required lighter pigmentation (black arrows), increased erythema (white arrows), and less hair on background skin (gray arrows) to correctly predict this image pictures a melanoma. c, Lack of prominent skin grooves or reticulation on the background skin (black arrows), alongside darker pigmentation (white arrows), contributed to another false positive prediction from Scanoma. Images adapted, with permission, from the ISIC archive,,.

Similar articles

References

    1. Wu E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nature Medicine 27, 582–584 (2021). - PubMed
    1. Reddy S. Explainability and artificial intelligence in medicine. The Lancet Digital Health 4, E214–E215 (4 2022). - PubMed
    1. Young A. T. et al. Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models. npj Digital Medicine (4 2021). - PMC - PubMed
    1. DeGrave A. J., Janizek J. D. & Lee S.-I. AI for radiographic COVID-19 detection selects shortcuts over signal. Nature Machine Intelligence (2021).
    1. Singh N. et al. Agreement between saliency maps and human-labeled regions of interest: applications to skin disease classification (2020).

Publication types