Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Sep 14:13:1265024.
doi: 10.3389/fonc.2023.1265024. eCollection 2023.

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology

Affiliations

Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology

Yixing Huang et al. Front Oncol. .

Abstract

Purpose: The potential of large language models in medicine for education and decision-making purposes has been demonstrated as they have achieved decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. This work aims to evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology.

Methods: The 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases are used to benchmark the performance of ChatGPT-4. The TXIT exam contains 300 questions covering various topics of radiation oncology. The 2022 Gray Zone collection contains 15 complex clinical cases.

Results: For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 62.05% and 78.77%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts.

Conclusion: Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Owing to the risk of hallucinations, it is essential to verify the content generated by models such as ChatGPT for accuracy.

Keywords: Gray Zone; artificial intelligence; clinical decision support (CDS); large language model; natural language processing; radiotherapy.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Two exemplary questions (Question 1 and Question 116) from the ACR TXIT exam.
Figure 2
Figure 2
An example of question-and-answer using ChatGPT for the ACR TXIT exam. ChatGPT-3.5 and ChatGPT-4 both provide the correct answer. However, ChatGPT-3.5 hallucinates the results of the NSABP B-51/RTOG 1304 trial (31), as the final findings are not publicly available yet.
Figure 3
Figure 3
The accuracy distribution for ChatGPT-3.5 and ChatGPT-4 depending on the question domain. The absolute number of correct answers for each domain is marked at the top of each bar. The domain number 1-13 correspond to statistics, bone & soft tissue, breast, CNS & eye, gastrointestinal, genitourinary, gynecology, head & neck & skin, lung & mediastinum, lymphoma & leukemia, pediatrics, biology, and physics, respectively. The X-axis labels are shifted to save space.
Figure 4
Figure 4
The accuracy distribution for ChatGPT-3.5 and ChatGPT-4 depending on the clinical care category. The absolute number of correct answers for each domamis marked at the top of each bar. The category number 1-8 correspond to diagnosis, treatment decision, treatment planning, prognosis, toxicity, brachytheraphy, dosimetry, and trial/study, respectively. The X-axis labels are shifted to save space.
Figure 5
Figure 5
An example of ChatGPT-4’s recommendation for the Gray Zone case #8 (38): A viewpoint on isolated contralateral axillary lymph node involvement by breast cancer: regional recurrence or distant metastasis? Note that the local recurrence statement in ChatGPT-4’s summary is incorrect.
Figure 6
Figure 6
The incorrect exponential decay values calculated by ChatGPT-3.5 and ChatGPT-4 for radioactive decay in Question 263 of the ACR TXIT exam.

References

    1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. . Attention is all you need. Adv Neural Inf Process Syst (2017) 30:1–13. Available at: https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd05....
    1. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. . Language models are few-shot learners. Adv Neural Inf Process Syst (2020) 33:1877–901.
    1. Wei J, Wang X, Schuurmans D, Bosma M, Chi E, Le Q, et al. . Chain of thought prompting elicits reasoning in large language models. NeurIPS (2022) 1–14.
    1. Thapa S, Adhikari S. ChatGPT, bard, and large language models for biomedical research: opportunities and pitfalls. Ann Biomed Eng (2023), 1–5. doi: 10.1007/s10439-023-03284-0#citeas - DOI - PubMed
    1. Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. . LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023) 1–17.

LinkOut - more resources