ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology
- PMID: 38995378
- PMCID: PMC11632015
- DOI: 10.1007/s00330-024-10902-5
ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology
Abstract
Objectives: To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology.
Materials and methods: We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.
Results: GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively).
Conclusion: GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.
Clinical relevance statement: GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization.
Key points: This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.
Keywords: Artificial intelligence; Natural language processing; Radiology.
© 2024. The Author(s).
Conflict of interest statement
Compliance with ethical standards. Guarantor: The scientific guarantor of this publication is Daiju Ueda. Conflict of Interest: The authors of this manuscript declare no relationships with any companies, whose products or services may be related to the subject matter of the article. Statistics and biometry: No complex statistical methods were necessary for this paper. Informed consent: Written informed consent was not required for this study because this study utilized published cases. Ethical approval: Institutional Review Board approval was obtained. Study subjects or cohorts overlap: No study subjects or cohorts have been previously reported. Methodology: Retrospective Diagnostic or prognostic study Performed at one institution
Figures
References
-
- OpenAI (2023) GPT-4 technical report. arXiv [csCL]. 10.48550/arXiv.2303.08774
-
- Brown TB, Mann B, Ryder N et al (2020) Language models are few-shot learners. arXiv [csCL]. 10.48550/arXiv.2005.14165
-
- Bubeck S, Chandrasekaran V, Eldan R et al (2023) Sparks of artificial general intelligence: early experiments with GPT-4. arXiv [csCL]. 10.48550/arXiv.2303.12712
-
- Eloundou T, Manning S, Mishkin P, Rock D (2023) GPTs are GPTs: an early look at the labor market impact potential of large language models. arXiv [econGN]. 10.48550/arXiv.2303.10130
-
- OpenAI, GPT-4V(ision) system card (2023) Available via https://openai.com/research/gpt-4v-system-card. Accessed Oct 13 2023
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
