Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 19;4(1):31.
doi: 10.1038/s41746-021-00385-9.

Do as AI say: susceptibility in deployment of clinical decision-aids

Affiliations

Do as AI say: susceptibility in deployment of clinical decision-aids

Susanne Gaube et al. NPJ Digit Med. .

Abstract

Artificial intelligence (AI) models for decision support have been developed for clinical settings such as radiology, but little work evaluates the potential impact of such systems. In this study, physicians received chest X-rays and diagnostic advice, some of which was inaccurate, and were asked to evaluate advice quality and make diagnoses. All advice was generated by human experts, but some was labeled as coming from an AI system. As a group, radiologists rated advice as lower quality when it appeared to come from an AI system; physicians with less task-expertise did not. Diagnostic accuracy was significantly worse when participants received inaccurate advice, regardless of the purported source. This work raises important considerations for how advice, AI and non-AI, should be deployed in clinical environments.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the experiment.
Each participant reviewed eight cases. For each case, the physician would see the chest X-ray as well as diagnostic advice, which would either be accurate or inaccurate. The advice was labeled as coming either from an AI system or an experienced radiologist. Participants were then asked to rate the quality of the advice and make a final diagnosis.
Fig. 2
Fig. 2. Advice quality rating across advice accuracy and source.
We demonstrate the effect of the accuracy of advice and source of advice on the quality rating across both types of physicians: task experts (radiologists), and non-experts (IM/EM physicians). In (a) we compare clinical advice ratings across accuracy, demonstrating that while both groups rated accurate advice as high-quality, only task experts rated inaccurate advice as low-quality. In (b) we compare clinical advice ratings across source, demonstrating that only task the experts rated purported human advice as significantly higher quality. There is no significant interaction between advice accuracy and advice source. The boxplots show 25th to 75th percentiles (lower and upper hinges) with the median depicted by the central line; the whiskers extend to a maximum of 1.5× interquartile range (IQR) beyond the boxes. *p ≤ 0.05, **p ≤ 0.001, ns = not significant.
Fig. 3
Fig. 3. Diagnostic accuracy across advice accuracy and source.
We demonstrate the effect of the accuracy of advice and source of advice on diagnostic accuracy for task experts (radiologists) and non-experts (IM/EM physicians). In (a) we compare diagnostic accuracy across advice accuracy, demonstrating that both groups perform better when they receive accurate advice. In (b) we compare diagnostic accuracy across advice sources, demonstrating that neither group of physicians had a significant difference in diagnostic accuracy depending on the source of advice. There is no significant interaction between advice accuracy and advice source. The error bars represent confidence intervals. *p ≤ 0.05, **p ≤ 0.001, ns = not significant.
Fig. 4
Fig. 4. Individual performance.
We show the individual performance of radiologists (a) and IM/EM physicians (b) sorted in increasing order by the number of cases they correctly diagnosed. Each physician’s individual performance is split into cases with accurate advice (the lower, blue part of the bar) and inaccurate advice (the upper, red part of the bar). We further indicate Critical Performers, who always recognize inaccurate advice, and Susceptible Performers, who never do.
Fig. 5
Fig. 5. Case performance.
Individual case performance amongst radiologists and IM/EM physicians. Each participant reviewed all eight cases; case order and advice accuracy was randomized per participant.

References

    1. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N. Engl. J. Med. 2019;380:1347–1358. doi: 10.1056/NEJMra1814259. - DOI - PubMed
    1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. - DOI - PubMed
    1. Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018;2:719–731. doi: 10.1038/s41551-018-0305-z. - DOI - PubMed
    1. Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–2410. doi: 10.1001/jama.2016.17216. - DOI - PubMed
    1. Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–118. doi: 10.1038/nature21056. - DOI - PMC - PubMed