Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Jan;35(1):506-516.
doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12.

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology

Affiliations
Comparative Study

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology

Daisuke Horiuchi et al. Eur Radiol. 2025 Jan.

Abstract

Objectives: To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology.

Materials and methods: We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.

Results: GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively).

Conclusion: GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.

Clinical relevance statement: GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization.

Key points: This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.

Keywords: Artificial intelligence; Natural language processing; Radiology.

PubMed Disclaimer

Conflict of interest statement

Compliance with ethical standards. Guarantor: The scientific guarantor of this publication is Daiju Ueda. Conflict of Interest: The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article. Statistics and biometry: No complex statistical methods were necessary for this paper. Informed consent: Written informed consent was not required for this study because this study utilized published cases. Ethical approval: Institutional Review Board approval was obtained. Study subjects or cohorts overlap: No study subjects or cohorts have been previously reported. Methodology: Retrospective Diagnostic or prognostic study Performed at one institution

Figures

Fig. 1
Fig. 1
Data collection flowchart
Fig. 2
Fig. 2
Input (patient’s medical history and imaging findings) and output examples of GPT-4-based ChatGPT. a Input texts to ChatGPT. b Output texts generated by ChatGPT. The differential diagnoses are outlined in blue and the final diagnosis is outlined in red. The final diagnosis generated by ChatGPT is correct in this case [33, 34]
Fig. 3
Fig. 3
Input (patient’s medical history and images) and output examples of GPT-4V-based ChatGPT. a Input to ChatGPT. b Output texts generated by ChatGPT. The differential diagnoses are outlined in blue and the final diagnosis is outlined in red. The final diagnosis generated by ChatGPT is correct in this case [33, 34]
Fig. 4
Fig. 4
A challenging case example for GPT-4-based ChatGPT. a Input texts (patient’s medical history and imaging findings) to ChatGPT. b Output texts generated by ChatGPT. The differential diagnoses are outlined in blue and the final diagnosis is outlined in red. While the differential diagnoses generated by ChatGPT include the correct diagnosis, the final diagnosis is incorrect in this case (true diagnosis: parosteal osteosarcoma) [35, 36]
Fig. 5
Fig. 5
A challenging case example for GPT-4V-based ChatGPT. a Input (patient’s medical history and images) to ChatGPT. b Output texts generated by ChatGPT. The differential diagnoses are outlined in blue; however, ChatGPT’s diagnosis is incorrect in this case (true diagnosis: parosteal osteosarcoma) [35, 36]
Fig. 6
Fig. 6
Diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists

References

    1. OpenAI (2023) GPT-4 technical report. arXiv [csCL]. 10.48550/arXiv.2303.08774
    1. Brown TB, Mann B, Ryder N et al (2020) Language models are few-shot learners. arXiv [csCL]. 10.48550/arXiv.2005.14165
    1. Bubeck S, Chandrasekaran V, Eldan R et al (2023) Sparks of artificial general intelligence: early experiments with GPT-4. arXiv [csCL]. 10.48550/arXiv.2303.12712
    1. Eloundou T, Manning S, Mishkin P, Rock D (2023) GPTs are GPTs: an early look at the labor market impact potential of large language models. arXiv [econGN]. 10.48550/arXiv.2303.10130
    1. OpenAI, GPT-4V(ision) system card (2023) Available via https://openai.com/research/gpt-4v-system-card. Accessed Oct 13 2023

Publication types

LinkOut - more resources