. 2023 Oct;20(10):990-997.

doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

Arya Rao¹, John Kim¹, Meghana Kamineni¹, Michael Pang¹, Winston Lie¹, Keith J Dreyer², Marc D Succi³

Affiliations

¹ Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts.
² Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; and Chief Data Science Officer and Chief Imaging Information Officer for Mass General Brigham, Boston, Massachusetts.
³ Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center and Associate Chair of Innovation & Commercialization, Mass General Brigham Enterprise Radiology; Executive Director, MESH Incubator. Massachusetts General Hospital, Boston, Massachusetts; and Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts. Electronic address: msucci@mgh.harvard.edu.

PMID: 37356806
PMCID: PMC10733745
DOI: 10.1016/j.jacr.2023.05.003

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

Arya Rao et al. J Am Coll Radiol. 2023 Oct.

. 2023 Oct;20(10):990-997.

doi: 10.1016/j.jacr.2023.05.003. Epub 2023 Jun 21.

Authors

Arya Rao¹, John Kim¹, Meghana Kamineni¹, Michael Pang¹, Winston Lie¹, Keith J Dreyer², Marc D Succi³

Affiliations

¹ Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts.
² Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; and Chief Data Science Officer and Chief Imaging Information Officer for Mass General Brigham, Boston, Massachusetts.
³ Harvard Medical School, Boston, Massachusetts; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center and Associate Chair of Innovation & Commercialization, Mass General Brigham Enterprise Radiology; Executive Director, MESH Incubator. Massachusetts General Hospital, Boston, Massachusetts; and Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts. Electronic address: msucci@mgh.harvard.edu.

PMID: 37356806
PMCID: PMC10733745
DOI: 10.1016/j.jacr.2023.05.003

Abstract

Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.

Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.

Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.

Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.

Keywords: AI; ChatGPT; breast imaging; clinical decision making; clinical decision support.

PubMed Disclaimer

Conflict of interest statement

The authors state that they have no conflict of interest related to the material discussed in this article. The authors are non-partner/non-partnership track/employees.

Figures

**Fig. 1.**
Schematic of experimental workflow. Prompts were developed from ACR variants for breast cancer screening and breast pain and converted to OE and SATA formats. Three independent users tested each prompt. Two independent scorers calculated scores for all outputs; these were compared to generate a consensus score. OE = open-ended; SATA = select all that apply.

**Fig. 2.**
Scoring criteria for OE and SATA prompts. Answers to OE prompts were scored on a 0 to 2 scale, in accordance with the ACR metrics for imaging appropriateness. If multiple imaging modalities were provided for a single prompt, an individual raw score was calculated for each modality, and these were averaged. Answers to SATA prompts were scored on a point or no point basis for each imaging modality provided. The maximum possible SATA score for a given variant was equal to the number of imaging procedures evaluated in the ACR criteria. OE = open-ended; SATA = select all that apply.

**Fig. 3.**
Performance of ChatGPT on OE prompts for breast cancer screening variants (A) and breast pain variants (B). OE performance was measured by the average raw score of the three replicate output scores for each variant (labeled according to the numbering in the ACR criteria variants). Error bars are ±1 standard deviation between the three replicate output scores. OE = open-ended; V1 = variant 1; V2 = variant 2; V3 = variant 3.

**Fig. 4.**
Performance of ChatGPT on SATA prompts for breast cancer screening variants (A) and breast pain variants (B). SATA performance was measured by the average proportion of correct answer selections for each variant from the three replicate output scores. Error bars for both prompt types are ±1 standard deviation between the three replicate output scores. SATA = select all that apply; V1 = variant 1; V2 = variant 2; V3 = variant 3; V4 = variant 4.

See this image and copyright information in PMC

Update of

Evaluating ChatGPT as an Adjunct for Radiologic Decision-Making.
Rao A, Kim J, Kamineni M, Pang M, Lie W, Succi MD. Rao A, et al. medRxiv [Preprint]. 2023 Feb 7:2023.02.02.23285399. doi: 10.1101/2023.02.02.23285399. medRxiv. 2023. Update in: J Am Coll Radiol. 2023 Oct;20(10):990-997. doi: 10.1016/j.jacr.2023.05.003. PMID: 36798292 Free PMC article. Updated. Preprint.

Comment in

Combining ChatGPT and machine learning: A viable alternative for discussion in oral medicine.
Santana LADM, Santos HTD, Gonçalo RIC, de Oliveira Costa CS, Barbosa BF, Alves ÊVM, de Santana TR, Freitas MMD, Takeshita WM, Trento CL. Santana LADM, et al. Oral Dis. 2024 Jul;30(5):3521-3522. doi: 10.1111/odi.14706. Epub 2023 Jul 30. Oral Dis. 2024. PMID: 37518993 No abstract available.
Transforming Radiology with Artificial Intelligence Visual Chatbot: A Balanced Perspective.
Goktas P, Agildere AM. Goktas P, et al. J Am Coll Radiol. 2024 Feb;21(2):224-225. doi: 10.1016/j.jacr.2023.07.023. Epub 2023 Sep 1. J Am Coll Radiol. 2024. PMID: 37659450 No abstract available.
Comment on: Combining ChatGPT and machine learning: A viable alternative in oral medicine.
Kleebayoon A, Wiwanitkit V. Kleebayoon A, et al. Oral Dis. 2024 Jul;30(5):3528. doi: 10.1111/odi.14742. Epub 2023 Sep 20. Oral Dis. 2024. PMID: 37731266 No abstract available.

References

1. Bizzo BC, Almeida RR, Michalski MH, Alkasab TK. Artificial intelligence and clinical decision support for radiologists and referring providers. J Am Coll Radiol 2019;16:1351–6. - PubMed
1. Witowski J, et al. MarkIt: a collaborative artificial intelligence annotation platform leveraging blockchain for medical imaging research. Blockchain Healthc Today 2021. 10.30953/bhty.v4.176. - DOI - PMC - PubMed
1. Li MD, et al. Automated tracking of emergency department abdominal CT findings during the COVID-19 pandemic using natural language processing. Am J Emerg Med 2021;49:52–7. - PMC - PubMed
1. Kim D, et al. Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model. Nat Commun 2022;13:1867. - PMC - PubMed
1. Chung J, et al. Prediction of oxygen requirement in patients with COVID-19 using a pre-trained chest radiograph xAI model: efficient development of auditable risk prediction models via a fine-tuning approach. Sci Rep 2022;12:21164. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

Affiliations

Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Comment in

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical