Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 20;17(3):e80874.
doi: 10.7759/cureus.80874. eCollection 2025 Mar.

Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams

Affiliations

Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams

Kian A Huang et al. Cureus. .

Abstract

Background The increasing integration of artificial intelligence (AI) in medical education and clinical practice has led to a growing interest in large language models (LLMs) for diagnostic reasoning and training. LLMs have demonstrated potential in interpreting medical text, summarizing findings, and answering radiology-related questions. However, their ability to accurately analyze both written and image-based content in radiology remains uncertain with newer models. This study evaluates the performance of OpenAI's Chat Generative Pre-trained Transformer 4o (ChatGPT-4o) and Google DeepMind's Gemini Advanced on the 2022 ACR (American College of Radiology) Diagnostic Radiology In-Training (DXIT) Exam to assess their capabilities in different radiological subfields. Methods ChatGPT-4o and Gemini Advanced were tested on 106 multiple-choice questions from the 2022 DXIT exam, which included both image-based and written-based questions spanning various radiological subspecialties. Their performance was compared using overall accuracy, subfield-specific accuracy, and two-proportion z-tests to determine significant differences. Results ChatGPT-4o achieved an overall accuracy of 69.8% (74/106), outperforming Gemini Advanced, which scored 60.4% (64/106), although the difference was not statistically significant (p = 0.151). In image-based questions (n = 64), ChatGPT-4o performed better (57.8%, 37/64) than Gemini Advanced (43.8%, 28/64). For written-based questions (n = 42), ChatGPT-4o and Gemini Advanced demonstrated similar accuracy (88.1% vs. 85.7%). ChatGPT-4o exhibited stronger performance in specific subfields, such as cardiac and nuclear radiology, but neither model showed consistent superiority across all radiology domains. Conclusion LLMs show promise in radiology education and diagnostic reasoning, particularly for text-based assessments. However, limitations such as inconsistent responses and lower accuracy in image interpretation highlight the need for further refinement. Future research should focus on improving AI models' reliability, multimodal capabilities, and integration into radiology training programs.

Keywords: artificial intelligence in radiology; chat gpt; chatgpt-4o; gemini advanced; radiology medical education.

PubMed Disclaimer

Conflict of interest statement

Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue. Conflicts of interest: In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.

Figures

Figure 1
Figure 1. Accuracy differences between ChatGPT-4o and Gemini Advanced
Data have been represented as differences in accuracy between ChatGPT-4o and Gemini Advanced (ChatGPT-4o Accuracy - Gemini Advanced Accuracy) among various radiology subfields in the DXIT exam, with positive values favoring ChatGPT-4o and negative values favoring Gemini Advanced. Non-existent bars demonstrate an accuracy difference of 0 between each LLM.

Similar articles

References

    1. The performance of artificial intelligence-based large language models on ophthalmology-related questions in Swedish proficiency test for medicine: ChatGPT-4 omni vs Gemini 1.5 Pro. Sabaner MC, Hashas ASK, Mutibayraktaroglu KM, Yozgat Z, Klefter ON, Subhi Y. AJO International. 2024;1
    1. Optimizing AI language models: a study of ChatGPT-4 vs. ChatGPT-4o [PREPRINT] Siddiky AM, Rahman ME, Hossen FB, Rahman MR, Jaman S. Preprints. 2025
    1. Evaluating the effectiveness of advanced large language models in medical knowledge: a comparative study using Japanese national medical examination. Liu M, Okuhara T, Dai Z, et al. Int J Med Inform. 2025;193:105673. - PubMed
    1. ChatGPT vs Gemini: comparative accuracy and efficiency in CAD-RADS score assignment from radiology reports. Silbergleit M, Tóth A, Chamberlin JH, et al. J Imaging Inform Med. 2024 - PubMed
    1. Performance of GPT-4 on the American College of Radiology In-training Examination: evaluating accuracy, model drift, and fine-tuning. Payne DL, Purohit K, Borrero WM, et al. Acad Radiol. 2024;31:3046–3054. - PubMed

LinkOut - more resources