Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity

Pradosh Kumar Sarangi¹, Suvrankar Datta², M Sarthak Swarup³, Swaha Panda⁴, Debasish Swapnesh Kumar Nayak⁵, Archana Malik⁶, Ananda Datta⁶, Himel Mondal⁷

Affiliations

¹ Department of Radiodiagnosis, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
² Department of Radiodiagnosis, All India Institute of Medical Sciences New Delhi, New Delhi, India.
³ Department of Radiodiagnosis, Vardhman Mahavir Medical College and Safdarjung Hospital New Delhi, New Delhi, India.
⁴ Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
⁵ Department of Computer Science and Engineering, SOET, Centurion University of Technology and Management, Bhubaneswar, Odisha, India.
⁶ Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
⁷ Department of Physiology, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.

PMID: 39318561
PMCID: PMC11419749
DOI: 10.1055/s-0044-1787974

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity

Pradosh Kumar Sarangi et al. Indian J Radiol Imaging. 2024.

. 2024 Jul 4;34(4):653-660.

doi: 10.1055/s-0044-1787974. eCollection 2024 Oct.

Authors

Pradosh Kumar Sarangi¹, Suvrankar Datta², M Sarthak Swarup³, Swaha Panda⁴, Debasish Swapnesh Kumar Nayak⁵, Archana Malik⁶, Ananda Datta⁶, Himel Mondal⁷

Affiliations

¹ Department of Radiodiagnosis, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
² Department of Radiodiagnosis, All India Institute of Medical Sciences New Delhi, New Delhi, India.
³ Department of Radiodiagnosis, Vardhman Mahavir Medical College and Safdarjung Hospital New Delhi, New Delhi, India.
⁴ Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
⁵ Department of Computer Science and Engineering, SOET, Centurion University of Technology and Management, Bhubaneswar, Odisha, India.
⁶ Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.
⁷ Department of Physiology, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India.

PMID: 39318561
PMCID: PMC11419749
DOI: 10.1055/s-0044-1787974

Abstract

Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)-Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity-in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE). Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score). Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = -0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001). Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.

Keywords: American College of Radiology Appropriateness Criteria; Bing; ChatGPT; Claude; Perplexity; large language model; pulmonary embolism.

Indian Radiological Association. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. ( https://creativecommons.org/licenses/by-nc-nd/4.0/ ).

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest None declared.

Figures

**Fig. 1**
Accuracy scores of answers of four large language models in open-ended and select all that apply type of questions.

**Fig. 2**
( **A, B** ) Response to variant 1 OE prompt for Bing for RAD1 (Accurate, score is 2) whereas inaccurate answer for the same prompt for RAD 2 (score is 0). OE, open-ended; RAD, radiologist.

See this image and copyright information in PMC

References

1. Panayides A S, Amini A, Filipovic N D et al.AI in medical imaging informatics: current challenges and future directions. IEEE J Biomed Health Inform. 2020;24(07):1837–1857. - PMC - PubMed
1. Bera K, O'Connor G, Jiang S, Tirumani S H, Ramaiya N. Analysis of ChatGPT publications in radiology: literature so far. Curr Probl Diagn Radiol. 2024;53(02):215–225. - PubMed
1. Tippareddy C, Jiang S, Bera Ket al.Radiology reading room for the future: harnessing the power of large language models like ChatGPT Curr Probl Diagn Radiol 2023(e-pub ahead of print)10.1067/j.cpradiol.2023.08.018 - DOI - PubMed
1. Amin K S, Davis M A, Doshi R, Haims A H, Khosla P, Forman H P. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for simplifying radiology reports. Radiology. 2023;309(02):e232561. - PubMed
1. Jeblick K, Schachtner B, Dexl J et al.ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. 2024;34(05):2817–2825. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Georg Thieme Verlag Stuttgart, New York
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity

Affiliations

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models-Bing, Claude, ChatGPT, and Perplexity

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources