Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Randomized Controlled Trial
. 2023 Dec 19;330(23):2275-2284.
doi: 10.1001/jama.2023.22295.

Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study

Affiliations
Randomized Controlled Trial

Measuring the Impact of AI in the Diagnosis of Hospitalized Patients: A Randomized Clinical Vignette Survey Study

Sarah Jabbour et al. JAMA. .

Abstract

Importance: Artificial intelligence (AI) could support clinicians when diagnosing hospitalized patients; however, systematic bias in AI models could worsen clinician diagnostic accuracy. Recent regulatory guidance has called for AI models to include explanations to mitigate errors made by models, but the effectiveness of this strategy has not been established.

Objectives: To evaluate the impact of systematically biased AI on clinician diagnostic accuracy and to determine if image-based AI model explanations can mitigate model errors.

Design, setting, and participants: Randomized clinical vignette survey study administered between April 2022 and January 2023 across 13 US states involving hospitalist physicians, nurse practitioners, and physician assistants.

Interventions: Clinicians were shown 9 clinical vignettes of patients hospitalized with acute respiratory failure, including their presenting symptoms, physical examination, laboratory results, and chest radiographs. Clinicians were then asked to determine the likelihood of pneumonia, heart failure, or chronic obstructive pulmonary disease as the underlying cause(s) of each patient's acute respiratory failure. To establish baseline diagnostic accuracy, clinicians were shown 2 vignettes without AI model input. Clinicians were then randomized to see 6 vignettes with AI model input with or without AI model explanations. Among these 6 vignettes, 3 vignettes included standard-model predictions, and 3 vignettes included systematically biased model predictions.

Main outcomes and measures: Clinician diagnostic accuracy for pneumonia, heart failure, and chronic obstructive pulmonary disease.

Results: Median participant age was 34 years (IQR, 31-39) and 241 (57.7%) were female. Four hundred fifty-seven clinicians were randomized and completed at least 1 vignette, with 231 randomized to AI model predictions without explanations, and 226 randomized to AI model predictions with explanations. Clinicians' baseline diagnostic accuracy was 73.0% (95% CI, 68.3% to 77.8%) for the 3 diagnoses. When shown a standard AI model without explanations, clinician accuracy increased over baseline by 2.9 percentage points (95% CI, 0.5 to 5.2) and by 4.4 percentage points (95% CI, 2.0 to 6.9) when clinicians were also shown AI model explanations. Systematically biased AI model predictions decreased clinician accuracy by 11.3 percentage points (95% CI, 7.2 to 15.5) compared with baseline and providing biased AI model predictions with explanations decreased clinician accuracy by 9.1 percentage points (95% CI, 4.9 to 13.2) compared with baseline, representing a nonsignificant improvement of 2.3 percentage points (95% CI, -2.7 to 7.2) compared with the systematically biased AI model.

Conclusions and relevance: Although standard AI models improve diagnostic accuracy, systematically biased AI models reduced diagnostic accuracy, and commonly used image-based AI model explanations did not mitigate this harmful effect.

Trial registration: ClinicalTrials.gov Identifier: NCT06098950.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: Dr Banovic reported receiving grants from the US Department of Energy, Toyota Research Institute, and National Science Foundation outside the submitted work. Dr Wiens reported receiving grants from the Alfred P. Sloan Foundation during the conduct of the study and serving on the advisory board of Machine Learning for Healthcare, a nonprofit organization that hosts a yearly academic conference. Dr Sjoding reported receiving royalties for a patent from Airstrip outside the submitted work. No other disclosures were reported.

Figures

Figure 1.
Figure 1.. Randomization and Study Flow Diagram for the 9 Clinical Vignettes
After completing informed consent, participants were randomized to artificial intelligence (AI) predictions with or without explanations and all participants were also randomized to 1 of 3 types of systematically biased AI models during a subset of vignettes in the study. The 3 systematically biased AI models included a model predicting pneumonia if aged 80 years or older, a model predicting heart failure if body mass index (BMI, calculated as weight in kilograms divided by height in meters squared) was 30 or higher, and a model predicting chronic obstructive pulmonary disease (COPD) if a blur was applied to the radiograph. Participants were first shown 2 vignettes without AI predictions to measure baseline diagnostic accuracy. The next 6 vignettes included AI predictions. If the participant was randomized to see AI explanations, the participant was also shown an AI model explanation with the AI predictions. Three vignettes had standard AI predictions, and 3 had biased AI predictions shown in random order. The final vignette included a clinical consultation, a short narrative provided by a hypothetical trusted colleague who identified the correct diagnosis and their diagnostic rationale.
Figure 2.
Figure 2.. Examples of Model Predictions and Explanations for Standard and Systematically Biased AI Models for Patients
Patient 1 is an 81-year-old male with respiratory failure from heart failure. A, The standard AI model correctly diagnosed heart failure as the cause of acute respiratory failure and provided an explanation highlighting areas in the chest radiograph used to make the prediction. B, The systematically biased AI model incorrectly diagnosed pneumonia as the cause of acute respiratory failure due to the patient’s age and provided an explanation highlighting irrelevant features in the chest radiograph. Standard-model predictions for heart failure (the correct diagnosis) are also provided. Patient 2 is an 88-year-old female with respiratory failure from COPD. C, The standard AI model incorrectly diagnosed pneumonia and correctly diagnosed COPD as the cause of respiratory failure and provided reasonable explanations. D, The biased AI model incorrectly diagnosed pneumonia as the cause of respiratory failure due to patient age and provided an explanation highlighting irrelevant features in the chest radiograph. Standard-model predictions for COPD (the correct diagnosis) are also provided.
Figure 3.
Figure 3.. Baseline Diagnostic Accuracy Without AI Models and Percentage Point Differences in Accuracy Across Clinical Vignette Settings
Baseline indicates diagnostic accuracy of heart failure, pneumonia, and chronic obstructive pulmonary disease (COPD) when shown clinical vignettes of patients with acute respiratory failure without AI model input; standard model, diagnostic accuracy when shown clinical vignettes and standard AI model diagnostic predictions about whether the patient has heart failure, pneumonia, and/or COPD; standard model plus explanations, diagnostic accuracy when shown standard AI predictions and an image-based AI explanation of the model’s reasoning for making a prediction within vignettes; systematically biased model, diagnostic accuracy when shown systematically biased AI predictions of low accuracy within vignettes; systematically biased model plus explanations, diagnostic accuracy when shown biased model predictions and explanations within vignettes; and clinical consultation, diagnostic accuracy when provided a short narrative describing the rational for the correct diagnosis within the vignette. Subgroup analysis included diagnostic accuracy specific to heart failure, pneumonia, and COPD; clinician profession, including 142 nurse practitioners or physician assistants, and 274 physicians; prior clinical decision–support interaction, including 132 participants who had prior experience interacting with clinical decision support systems and 286 who did not. Diagnostic accuracy and percentage point differences in accuracy were determined by calculating predictive margins and contrasts across vignette settings after fitting a cross-classified generalized random effects model of diagnostic accuracy.
Figure 4.
Figure 4.. Baseline Treatment Selection Accuracy Without AI Models and Percentage Point Differences in Accuracy Across Clinical Vignette Settings
Baseline treatment selection accuracy indicates accurate administration of antibiotics, diuretics, and/or steroids after reviewing vignettes of patients with acute respiratory failure without AI model input; correct model, treatment accuracy when shown vignette with correct AI model diagnostic predictions of heart failure, pneumonia, and/or COPD; correct model plus explanations, treatment accuracy when shown a vignette with correct AI model diagnostic predictions and an image-based AI explanation of the model’s reason for making a prediction; incorrect model, treatment accuracy when shown a vignette with incorrect AI model diagnostic predictions; and incorrect model plus explanation, treatment accuracy when shown incorrect AI model diagnostic predictions and explanations. Subgroup analysis included treatment selection accuracy specific to antibiotics, intravenous diuretics, and steroids. Treatment selection accuracy and percentage point differences in accuracy were determined by calculating predictive margins and contrasts across vignette settings after fitting a cross-classified generalized random-effects model of treatment selection accuracy across settings.

Comment in

Similar articles

Cited by

References

    1. Tschandl P, Rinner C, Apalla Z, et al. . Human-computer collaboration for skin cancer recognition. Nat Med. 2020;26(8):1229-1234. doi:10.1038/s41591-020-0942-0 - DOI - PubMed
    1. Gulshan V, Peng L, Coram M, et al. . Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216 - DOI - PubMed
    1. van der Laak J, Litjens G, Ciompi F. Deep learning in histopathology: the path to the clinic. Nat Med. 2021;27(5):775-784. doi:10.1038/s41591-021-01343-4 - DOI - PubMed
    1. Kather JN, Weis C-A, Bianconi F, et al. . Multi-class texture analysis in colorectal cancer histology. Sci Rep. 2016;6(1):27988. doi:10.1038/srep27988 - DOI - PMC - PubMed
    1. Jabbour S, Fouhey D, Kazerooni E, Sjoding MW, Wiens J. Deep learning applied to chest x-rays: exploiting and preventing shortcuts. Proc Mach Learn Res. 2020;126:750-782.

Publication types

MeSH terms

Associated data