Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases
- PMID: 38854352
- PMCID: PMC11162509
- DOI: 10.7759/cureus.60009
Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases
Abstract
Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.
Keywords: chat generative pre-trained transformer (chatgpt); gpt-4; large language models; prompt engineering; radiology.
Copyright © 2024, Cesur et al.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures





Similar articles
-
Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.Clin Neuroradiol. 2024 Dec;34(4):779-787. doi: 10.1007/s00062-024-01426-y. Epub 2024 May 28. Clin Neuroradiol. 2024. PMID: 38806794
-
Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models.JMIR Med Educ. 2024 Feb 13;10:e51391. doi: 10.2196/51391. JMIR Med Educ. 2024. PMID: 38349725 Free PMC article.
-
ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.Eur Radiol. 2025 Jan;35(1):506-516. doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12. Eur Radiol. 2025. PMID: 38995378 Free PMC article.
-
Optimizing Large Language Models in Radiology and Mitigating Pitfalls: Prompt Engineering and Fine-tuning.Radiographics. 2025 Apr;45(4):e240073. doi: 10.1148/rg.240073. Radiographics. 2025. PMID: 40048389 Review.
-
How to Harness the Power of GPT for Scientific Research: A Comprehensive Review of Methodologies, Applications, and Ethical Considerations.Nucl Med Mol Imaging. 2024 Oct;58(6):323-331. doi: 10.1007/s13139-024-00876-z. Epub 2024 Aug 12. Nucl Med Mol Imaging. 2024. PMID: 39308492 Free PMC article. Review.
Cited by
-
Comparison of ChatGPT and Internet Research for Clinical Research and Decision-Making in Occupational Medicine: Randomized Controlled Trial.JMIR Form Res. 2025 May 20;9:e63857. doi: 10.2196/63857. JMIR Form Res. 2025. PMID: 40393042 Free PMC article. Clinical Trial.
-
Is ChatGPT a Reliable Tool for Explaining Medical Terms?Cureus. 2025 Jan 10;17(1):e77258. doi: 10.7759/cureus.77258. eCollection 2025 Jan. Cureus. 2025. PMID: 39931624 Free PMC article.
-
Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.Ultrasonography. 2025 May;44(3):220-231. doi: 10.14366/usg.25012. Epub 2025 Mar 11. Ultrasonography. 2025. PMID: 40235070 Free PMC article.
-
Gender Differences in the Use of ChatGPT as Generative Artificial Intelligence for Clinical Research and Decision-Making in Occupational Medicine.Healthcare (Basel). 2025 Jun 11;13(12):1394. doi: 10.3390/healthcare13121394. Healthcare (Basel). 2025. PMID: 40565419 Free PMC article.
-
Evaluating the influence of prompt formulation on the reliability and repeatability of ChatGPT in implant-supported prostheses.PLoS One. 2025 May 30;20(5):e0323086. doi: 10.1371/journal.pone.0323086. eCollection 2025. PLoS One. 2025. PMID: 40445924 Free PMC article.
References
-
- Large language models in medicine. Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Nat Med. 2023;29:1930–1940. - PubMed
-
- Gunes YC, Cesur T. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. medRxiv [preprint. [ Feb; 2024 ]. 2024. https://doi.org/10.1101/2024.01.18.24301495 https://doi.org/10.1101/2024.01.18.24301495
LinkOut - more resources
Full Text Sources