Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 9;16(5):e60009.
doi: 10.7759/cureus.60009. eCollection 2024 May.

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Affiliations

Optimizing Diagnostic Performance of ChatGPT: The Impact of Prompt Engineering on Thoracic Radiology Cases

Turay Cesur et al. Cureus. .

Abstract

Background Recent studies have highlighted the diagnostic performance of ChatGPT 3.5 and GPT-4 in a text-based format, demonstrating their radiological knowledge across different areas. Our objective is to investigate the impact of prompt engineering on the diagnostic performance of ChatGPT 3.5 and GPT-4 in diagnosing thoracic radiology cases, highlighting how the complexity of prompts influences model performance. Methodology We conducted a retrospective cross-sectional study using 124 publicly available Case of the Month examples from the Thoracic Society of Radiology website. We initially input the cases into the ChatGPT versions without prompting. Then, we employed five different prompts, ranging from basic task-oriented to complex role-specific formulations to measure the diagnostic accuracy of ChatGPT versions. The differential diagnosis lists generated by the models were compared against the radiological diagnoses listed on the Thoracic Society of Radiology website, with a scoring system in place to comprehensively assess the accuracy. Diagnostic accuracy and differential diagnosis scores were analyzed using the McNemar, Chi-square, Kruskal-Wallis, and Mann-Whitney U tests. Results Without any prompts, ChatGPT 3.5's accuracy was 25% (31/124), which increased to 56.5% (70/124) with the most complex prompt (P < 0.001). GPT-4 showed a high baseline accuracy at 53.2% (66/124) without prompting. This accuracy increased to 59.7% (74/124) with complex prompts (P = 0.09). Notably, there was no statistical difference in peak performance between ChatGPT 3.5 (70/124) and GPT-4 (74/124) (P = 0.55). Conclusions This study emphasizes the critical influence of prompt engineering on enhancing the diagnostic performance of ChatGPT versions, especially ChatGPT 3.5.

Keywords: chat generative pre-trained transformer (chatgpt); gpt-4; large language models; prompt engineering; radiology.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Flowchart of the study.
Since March 2012, the Thoracic Society of Radiology has been publishing the publicly accessible Case of the Month cases on its website (https://thoracicrad.org). These cases included comprehensive medical histories, imaging findings, diagnoses, differential diagnoses, and discussion sections. Image credit: Turay Cesur. In our study, we combined the History and Findings sections from the cases to formulate questions. We considered the answers in the Diagnosis section as the correct responses.
Figure 2
Figure 2. Workchart of the input and output process of the study.
The prompts were initially entered into the ChatGPT 3.5 and GPT-4 versions. Subsequently, all 124 cases were asked beneath each prompt. Correct (checkmark) and incorrect (cross) answers were recorded. Symbols of both Thoracic Radiology Cases and ChatGPT were taken from the original sites. Image credit: Turay Cesur.
Figure 3
Figure 3. Diagnostic accuracy percentages of all prompts in all questions.
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, Specific Role + Specific Task Prompt; SR+ST+E, Specific Role + Specific Task + Exemplar Prompt
Figure 4
Figure 4. Boxplots show differential diagnosis scores for all questions of the prompts on GPT-4.
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, Specific Role + Specific Task Prompt; SR+ST+E, Specific Role + Specific Task + Exemplar Prompt; x, mode of the differential diagnosis score
Figure 5
Figure 5. Boxplots show differential diagnosis scores (DDx scores) for all questions of the prompts on ChatGPT 3.5.
P, Physician Prompt; T, Task Prompt; ST, Special Task Prompt; SR+ST, (Specific Role + Specific Task Prompt); SR+ST+E, Specific Role + Specific Task + Exemplar Prompt; x, mode of the differential diagnosis score

Similar articles

Cited by

References

    1. Large language models in medicine. Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Nat Med. 2023;29:1930–1940. - PubMed
    1. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Sallam M. Healthcare (Basel) 2023;11:887. - PMC - PubMed
    1. Overview of early ChatGPT’s presence in medical literature: insights from a hybrid literature review by ChatGPT and human experts. Temsah O, Khan SA, Chaiah Y, et al. Cureus. 2023;15:0. - PMC - PubMed
    1. Gunes YC, Cesur T. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. A Comparative Study: Diagnostic Performance of ChatGPT 3.5, Google Bard, Microsoft Bing, and Radiologists in Thoracic Radiology Cases. medRxiv [preprint. [ Feb; 2024 ]. 2024. https://doi.org/10.1101/2024.01.18.24301495 https://doi.org/10.1101/2024.01.18.24301495
    1. Artificial intelligence (AI) in Radiology: a deep dive into ChatGPT 4.0’s accuracy with the American Journal of neuroradiology’s (AJNR) “case of the month”. Suthar PP, Kounsal A, Chhetri L, Saini D, Dua SG. Cureus. 2023;15:0. - PMC - PubMed

LinkOut - more resources