Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Turay Cesur¹, Yasin Celal Güneş²

Affiliations

PMID: 38854352
PMCID: PMC11162509
DOI: 10.7759/cureus.60009

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Turay Cesur et al. Cureus. 2024.

. 2024 May 9;16(5):e60009.

doi: 10.7759/cureus.60009. eCollection 2024 May.

Authors

Turay Cesur¹, Yasin Celal Güneş²

Affiliations

¹ Radiology, Ankara Mamak State Hospital, Ankara, TUR.
² Radiology, Kırıkkale Yuksek Ihtisas Hospital, Ankara, TUR.

PMID: 38854352
PMCID: PMC11162509
DOI: 10.7759/cureus.60009

Abstract

Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.

Keywords: chat generative pre-trained transformer (chatgpt); gpt-4; large language models; prompt engineering; radiology.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Flowchart of the study.**
Since March 2012, the Thoracic Society of Radiology has been publishing the publicly accessible *Case of the Month* cases on its website (https://thoracicrad.org). These cases included comprehensive medical histories, imaging findings, diagnoses, differential diagnoses, and discussion sections. Image credit: Turay Cesur. In our study, we combined the *History* and *Findings* sections from the cases to formulate questions. We considered the answers in the *Diagnosis* section as the correct responses.

**Figure 2. Workchart of the input and output process of the study.**
The prompts were initially entered into the ChatGPT 3.5 and GPT-4 versions. Subsequently, all 124 cases were asked beneath each prompt. Correct (checkmark) and incorrect (cross) answers were recorded. Symbols of both Thoracic Radiology Cases and ChatGPT were taken from the original sites. Image credit: Turay Cesur.

**Figure 3. Diagnostic accuracy percentages of all prompts in all questions.**
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, Specific Role + Specific Task Prompt; SR+ST+E, Specific Role + Specific Task + Exemplar Prompt

**Figure 4. Boxplots show differential diagnosis scores for all questions of the prompts on GPT-4.**
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, Specific Role + Specific Task Prompt; SR+ST+E, Specific Role + Specific Task + Exemplar Prompt; x, mode of the differential diagnosis score

**Figure 5. Boxplots show differential diagnosis scores (DDx scores) for all questions of the prompts on ChatGPT 3.5.**
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, (Specific Role + Specific Task Prompt); SR+ST+E, Specific Role + Specific Task + Exemplar Prompt; x, mode of the differential diagnosis score

See this image and copyright information in PMC

References

1. Large language models in medicine. Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Nat Med. 2023;29:1930–1940. - PubMed
1. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Sallam M. Healthcare (Basel) 2023;11:887. - PMC - PubMed
1. Overview of early ChatGPT’s presence in medical literature: insights from a hybrid literature review by ChatGPT and human experts. Temsah O, Khan SA, Chaiah Y, et al. Cureus. 2023;15:0. - PMC - PubMed
1. Gunes YC, Cesur T. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. medRxiv [preprint. [ Feb; 2024 ]. 2024. https://doi.org/10.1101/2024.01.18.24301495 https://doi.org/10.1101/2024.01.18.24301495
1. Artificial intelligence (AI) in Radiology: a deep dive into ChatGPT 4.0’s accuracy with the American Journal of neuroradiology’s (AJNR) “case of the month”. Suthar PP, Kounsal A, Chhetri L, Saini D, Dua SG. Cureus. 2023;15:0. - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Affiliations

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources