Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 22:25:e50865.
doi: 10.2196/50865.

Evaluation of GPT-4's Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

Affiliations

Evaluation of GPT-4's Chest X-Ray Impression Generation: A Reader Study on Performance and Perception

Sebastian Ziegelmayer et al. J Med Internet Res. .

Abstract

Exploring the generative capabilities of the multimodal GPT-4, our study uncovered significant differences between radiological assessments and automatic evaluation metrics for chest x-ray impression generation and revealed radiological bias.

Keywords: AI; GPT; artificial intelligence; chest; diagnostic; generative; generative model; image; images; imaging; impression; impressions; medical imaging; multimodal; radiography; radiological; radiology; x-ray; x-rays.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1
Figure 1
Scatterplots for each automated metric (BERT=blue; BLEU=yellow; CheXbert vector similarity=gray; RadGraph=light blue; RadCliQ=red) depending on the input: (A) image, (B) text, or (C) text and image. For the image input, all metrics except CheXbert vector similarity showed a significant correlation. However, the correlation was divergent or opposing for the text and text and image inputs. All correlation coefficients with their P values are shown in the lower section of the figure. BERT: Bidirectional Encoder Representations From Transformers; BLEU: bilingual evaluation understudy.

References

    1. Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A. Language models are few-shot learners. arXiv. Preprint posted online on May 28, 2020. https://arxiv.org/pdf/2005.14165.pdf
    1. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G. Learning transferable visual models from natural language supervision. 38th International Conference on Machine Learning; July 18-24, 2021; Virtual. 2021.
    1. Bhayana R, Bleakney R, Krishna S. GPT-4 in radiology: improvements in advanced reasoning. Radiology. 2023 Jun;307(5):e230987. doi: 10.1148/radiol.230987. - DOI - PubMed
    1. Adams L, Truhn D, Busch F, Kader A, Niehues SM, Makowski MR, Bressem KK. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. 2023 May;307(4):e230725. doi: 10.1148/radiol.230725. - DOI - PubMed
    1. Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, Lucas E, Shih G, Peng Y. Evaluating GPT4 on impressions generation in radiology reports. Radiology. 2023 Jun;307(5):e231259. doi: 10.1148/radiol.231259. https://europepmc.org/abstract/MED/37367439 - DOI - PMC - PubMed

LinkOut - more resources