Do as AI say: susceptibility in deployment of clinical decision-aids

Susanne Gaube^#^{1

2}, Harini Suresh^#³, Martina Raue⁴, Alexander Merritt⁵, Seth J Berkowitz⁶, Eva Lermer^{7

8}, Joseph F Coughlin⁴, John V Guttag⁹, Errol Colak^{10

11}, Marzyeh Ghassemi^{12

13}

Affiliations

¹ Department of Psychology, University of Regensburg, Regensburg, Germany. susanne.gaube@ur.de.
² MIT AgeLab, Massachusetts Institute of Technology, Cambridge, MA, USA. susanne.gaube@ur.de.
³ MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA. hsuresh@mit.edu.
⁴ MIT AgeLab, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁵ Boston Medical Center, Boston, MA, USA.
⁶ Department of Radiology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁷ LMU Center for Leadership and People Management, LMU Munich, Munich, Germany.
⁸ FOM University of Applied Sciences for Economics & Management, Munich, Germany.
⁹ MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
¹⁰ Li Ka Shing Knowledge Institute, St. Michael's Hospital, Toronto, Canada.
¹¹ Department of Medical Imaging, University of Toronto, Toronto, Canada.
¹² Departments of Computer Science and Medicine, University of Toronto, Toronto, Canada.
¹³ Vector Institute, Toronto, Canada.

^# Contributed equally.

PMID: 33608629
PMCID: PMC7896064
DOI: 10.1038/s41746-021-00385-9

Do as AI say: susceptibility in deployment of clinical decision-aids

Susanne Gaube et al. NPJ Digit Med. 2021.

. 2021 Feb 19;4(1):31.

doi: 10.1038/s41746-021-00385-9.

Authors

Affiliations

¹ Department of Psychology, University of Regensburg, Regensburg, Germany. susanne.gaube@ur.de.
² MIT AgeLab, Massachusetts Institute of Technology, Cambridge, MA, USA. susanne.gaube@ur.de.
³ MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA. hsuresh@mit.edu.
⁴ MIT AgeLab, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁵ Boston Medical Center, Boston, MA, USA.
⁶ Department of Radiology, Beth Israel Deaconess Medical Center, Boston, MA, USA.
⁷ LMU Center for Leadership and People Management, LMU Munich, Munich, Germany.
⁸ FOM University of Applied Sciences for Economics & Management, Munich, Germany.
⁹ MIT Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
¹⁰ Li Ka Shing Knowledge Institute, St. Michael's Hospital, Toronto, Canada.
¹¹ Department of Medical Imaging, University of Toronto, Toronto, Canada.
¹² Departments of Computer Science and Medicine, University of Toronto, Toronto, Canada.
¹³ Vector Institute, Toronto, Canada.

^# Contributed equally.

PMID: 33608629
PMCID: PMC7896064
DOI: 10.1038/s41746-021-00385-9

Abstract

Artificial intelligence (AI) models for decision support have been developed for clinical settings such as radiology, but little work evaluates the potential impact of such systems. In this study, physicians received chest X-rays and diagnostic advice, some of which was inaccurate, and were asked to evaluate advice quality and make diagnoses. All advice was generated by human experts, but some was labeled as coming from an AI system. As a group, radiologists rated advice as lower quality when it appeared to come from an AI system; physicians with less task-expertise did not. Diagnostic accuracy was significantly worse when participants received inaccurate advice, regardless of the purported source. This work raises important considerations for how advice, AI and non-AI, should be deployed in clinical environments.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Overview of the experiment.**
Each participant reviewed eight cases. For each case, the physician would see the chest X-ray as well as diagnostic advice, which would either be accurate or inaccurate. The advice was labeled as coming either from an AI system or an experienced radiologist. Participants were then asked to rate the quality of the advice and make a final diagnosis.

**Fig. 2. Advice quality rating across advice accuracy and source.**
We demonstrate the effect of the accuracy of advice and source of advice on the quality rating across both types of physicians: task experts (radiologists), and non-experts (IM/EM physicians). In (a) we compare clinical advice ratings across accuracy, demonstrating that while both groups rated accurate advice as high-quality, only task experts rated inaccurate advice as low-quality. In (b) we compare clinical advice ratings across source, demonstrating that only task the experts rated purported human advice as significantly higher quality. There is no significant interaction between advice accuracy and advice source. The boxplots show 25th to 75th percentiles (lower and upper hinges) with the median depicted by the central line; the whiskers extend to a maximum of 1.5× interquartile range (IQR) beyond the boxes. ^*p ≤ 0.05, ^**p ≤ 0.001, ns = not significant.

**Fig. 3. Diagnostic accuracy across advice accuracy and source.**
We demonstrate the effect of the accuracy of advice and source of advice on diagnostic accuracy for task experts (radiologists) and non-experts (IM/EM physicians). In (a) we compare diagnostic accuracy across advice accuracy, demonstrating that both groups perform better when they receive accurate advice. In (b) we compare diagnostic accuracy across advice sources, demonstrating that neither group of physicians had a significant difference in diagnostic accuracy depending on the source of advice. There is no significant interaction between advice accuracy and advice source. The error bars represent confidence intervals. ^*p ≤ 0.05, ^**p ≤ 0.001, ns = not significant.

**Fig. 4. Individual performance.**
We show the individual performance of radiologists (a) and IM/EM physicians (b) sorted in increasing order by the number of cases they correctly diagnosed. Each physician’s individual performance is split into cases with accurate advice (the lower, blue part of the bar) and inaccurate advice (the upper, red part of the bar). We further indicate Critical Performers, who always recognize inaccurate advice, and Susceptible Performers, who never do.

**Fig. 5. Case performance.**
Individual case performance amongst radiologists and IM/EM physicians. Each participant reviewed all eight cases; case order and advice accuracy was randomized per participant.

See this image and copyright information in PMC

References

1. Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N. Engl. J. Med. 2019;380:1347–1358. doi: 10.1056/NEJMra1814259. - DOI - PubMed
1. Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019;25:44–56. doi: 10.1038/s41591-018-0300-7. - DOI - PubMed
1. Yu K-H, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2018;2:719–731. doi: 10.1038/s41551-018-0305-z. - DOI - PubMed
1. Gulshan V, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–2410. doi: 10.1001/jama.2016.17216. - DOI - PubMed
1. Esteva A, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115–118. doi: 10.1038/nature21056. - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Do as AI say: susceptibility in deployment of clinical decision-aids

Affiliations

Do as AI say: susceptibility in deployment of clinical decision-aids

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials