Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Mar;27(3):214-226.
doi: 10.3348/kjr.2025.1045.

Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics

Affiliations

Evaluating the Accuracy and Diagnostic Reasoning of Multimodal Large Language Models in Interpreting Neuroradiology Cases From RadioGraphics

Pae Sun Suh et al. Korean J Radiol. 2026 Mar.

Abstract

Objective: To evaluate the accuracy and reasoning capabilities of large multimodal language models compared with those of neuroradiology subspecialty-trained radiologists in neuroradiology case interpretation.

Materials and methods: This experimental study used custom-made 401 radiologic quizzes derived from articles published in RadioGraphics covering neuroradiology and head and neck topics (October 2020 to February 2024). We prompted the GPT-4 Turbo with Vision (GPT-4V), GPT-4 Omni, Gemini Flash, and Claude models to provide the top three differential diagnoses with a rationale and describe examination characteristics such as imaging modality, sequence, use of contrast, image plane, and body part. The temperature was adjusted to 0 and 1 (T1). Two neuroradiologists answered the same questions. The accuracies of the large language models (LLMs) and the neuroradiologists were compared using generalized estimating equations. Three neuroradiologists assessed the rationale provided by the LLMs for their differential diagnoses using four-point scales, separately for specific lesion locations and imaging findings, and evaluated the presence of hallucinations and the overall acceptability of the responses.

Results: Top-3 accuracy (i.e., correct answers present among top-3 differential diagnoses) of LLMs ranged from 29.9% (120 of 401) to 49.4% (198 of 401, obtained with GPT-4V in the T1 setting), while radiologists achieved 80.3% (322 of 401) and 68.3% (274 of 401), respectively (P < 0.001). Regarding the rationale for differential diagnoses, GPT-4V (T1) accurately identified both the specific lesion location and imaging findings in 30.7% (123 of 401) and 12.9% (16 of 124) of cases without textual clinical history. Hallucinations occurred in 4.5% (18 of 401), and only 29.4% (118 of 401) of the LLM-generated analyses were deemed acceptable. GPT-4V (T1) demonstrated high accuracy in identifying the imaging modality (97.4% [800 of 821]) and scanned body parts (92.2% [756 of 820]).

Conclusion: LLMs remarkably underperformed compared with neuroradiologists and showed unsatisfactory reasoning for their differential diagnoses, with performance declining further in cases without textual input of clinical history. These findings highlight the limitations of current multimodal LLMs in neuroradiological interpretation and their reliance on text input.

Keywords: Image interpretation; Large language model; Rationale evaluation; Vision capability.

PubMed Disclaimer

Conflict of interest statement

Chong Hyun Suh, an Assistant to the Editor of the Korean Journal of Radiology, was not involved in the editorial evaluation or decision to publish this article. The remaining author has declared no conflicts of interest.

Figures

Fig. 1
Fig. 1. Flowchart of the study. LLMs = large language models, GPT-4V = GPT-4 Turbo with Vision, GPT-4o = GPT-4 Omni
Fig. 2
Fig. 2. Matrix evaluation using a four-point scale for specific lesion location and imaging findings to assess GPT-4V’s (temperature 1) reasoning in image interpretation. Yellow squares indicate precise interpretation (describing the correct location and correct/partially correct imaging findings), and red squares indicate inaccurate interpretation (incorrect or not describing both location and imaging findings). A: The matrix of overall image interpretation ability shows 30.7% precision and 35.9% inaccuracy. B, C: Matrix for cases with correct answers shows 51.0% precise interpretation (B) and 63.1% inaccurate interpretation in cases with incorrect answers (C). D: The matrix in cases using text inputs with clinical history showed a precise interpretation of 38.6%; however, lesion location and imaging findings were not described in 31.0% and 9.0% of the cases, respectively. E: In cases using text inputs without clinical history, 58.1% showed inaccurate interpretations.
Fig. 3
Fig. 3. An example of accurate interpretation by GPT-4V with T1. This radiologic quiz was created based on a study by Katsura et al. [34], featuring a 27-year-old female patient with a history of surgical resection and radiation therapy with temozolomide for anaplastic pleomorphic xanthoastrocytoma in the right temporal lobe. The correct diagnosis is “tumor recurrence.” GPT-4V (T1) listed tumor recurrence as the first differential diagnosis, with correct identification of the specific lesion location and imaging findings. Yellow highlights indicate the described lesion location, and green highlights indicate imaging findings. GPT-4V = GPT-4 Turbo with Vision, T1 = temperature 1
Fig. 4
Fig. 4. An example of inaccurate interpretation by GPT-4V with T1. This radiologic quiz was created based on an article by Kurokawa et al. [35]. This case features a 4-year-old male patient presenting with symptoms of abdominal pain and weight loss. The correct diagnosis is “ALK-positive histiocytosis.” GPT-4V (T1) provided abdominal malignancies as differential diagnoses based on the patients’ symptoms, with incorrect and unidentifiable lesion locations and incorrect imaging findings. Yellow highlights indicate the described lesion location, and green highlights indicate imaging findings. GPT-4V = GPT-4 Turbo with Vision, T1 = temperature 1
Fig. 5
Fig. 5. Accuracy for describing imaging information using GPT-4V with temperature 1. GPT-4V = GPT-4 Turbo with Vision

References

    1. Akinci D’Antonoli T, Stanzione A, Bluethgen C, Vernuccio F, Ugga L, Klontzas ME, et al. Large language models in radiology: fundamentals, applications, ethical considerations, risks, and future directions. Diagn Interv Radiol. 2024;30:80–90. - PMC - PubMed
    1. Gertz RJ, Bunck AC, Lennartz S, Dratsch T, Iuga AI, Maintz D, et al. GPT-4 for automated determination of radiologic study and protocol based on radiology request forms: a feasibility study. Radiology. 2023;307:e230877. - PubMed
    1. Suh PS, Shim WH, Suh CH, Heo H, Park CR, Eom HJ, et al. Comparing diagnostic accuracy of radiologists versus GPT-4V and Gemini Pro Vision using image inputs from diagnosis please cases. Radiology. 2024;312:e240273. - PubMed
    1. Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, et al. Comparing the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in challenging neuroradiology cases. Clin Neuroradiol. 2024;34:779–787. - PubMed
    1. Horiuchi D, Tatekawa H, Oura T, Shimono T, Walston SL, Takita H, et al. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35:506–516. - PMC - PubMed

LinkOut - more resources