Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 Sep;43(9):1445-1455.
doi: 10.1007/s11604-025-01799-1. Epub 2025 May 14.

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan

Affiliations
Comparative Study

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan

Hirotaka Takita et al. Jpn J Radiol. 2025 Sep.

Abstract

Purpose: To compare the diagnostic performance of three proprietary large language models (LLMs)-Claude, GPT, and Gemini-in structuring free-text Japanese radiology reports for intracranial hemorrhage and skull fractures, and to assess the impact of three different prompting approaches on model accuracy.

Materials and methods: In this retrospective study, head CT reports from the Japan Medical Imaging Database between 2018 and 2023 were collected. Two board-certified radiologists established the ground truth regarding intracranial hemorrhage and skull fractures through independent review and consensus. Each radiology report was analyzed by three LLMs using three prompting strategies-Standard, Chain of Thought, and Self Consistency prompting. Diagnostic performance (accuracy, precision, recall, and F1-score) was calculated for each LLM-prompt combination and compared using McNemar's tests with Bonferroni correction. Misclassified cases underwent qualitative error analysis.

Results: A total of 3949 head CT reports from 3949 patients (mean age 59 ± 25 years, 56.2% male) were enrolled. Across all institutions, 856 patients (21.6%) had intracranial hemorrhage and 264 patients (6.6%) had skull fractures. All nine LLM-prompt combinations achieved very high accuracy. Claude demonstrated significantly higher accuracy for intracranial hemorrhage than GPT and Gemini, and also outperformed Gemini for skull fractures (p < 0.0001). Gemini's performance improved notably with Chain of Thought prompting. Error analysis revealed common challenges including ambiguous phrases and findings unrelated to intracranial hemorrhage or skull fractures, underscoring the importance of careful prompt design.

Conclusion: All three proprietary LLMs exhibited strong performance in structuring free-text head CT reports for intracranial hemorrhage and skull fractures. While the choice of prompting method influenced accuracy, all models demonstrated robust potential for clinical and research applications. Future work should refine the prompts and validate these approaches in prospective, multilingual settings.

Keywords: Free-text radiology report; Intracranial hemorrhage; Japan medical imaging database; Large language model; Skull fracture; Structured radiology report.

PubMed Disclaimer

Conflict of interest statement

Declarations. Conflict of interest: The authors have no relevant financial or non-financial interests to disclose. Ethical approval: The study protocol was approved by the Ethical Committee of Juntendo University Graduate School of Medicine and Osaka Metropolitan Universtiy, and the study was conducted in accordance with the Declaration of Helsinki. Informed consent: The requirement for informed consent was waived because, in the Japanese Medical Imaging Database, radiology reports are completely anonymized and there is no concern about identifying personal information.

Figures

Fig. 1
Fig. 1
Overview of the study design for evaluating large language model (LLM) performance in structured reporting of head CT findings. Free-text radiology reports were collected from the Japan Medical Imaging Database (J-MID), focusing on emergency department head CT reports from 2018 to 2023. LLM analysis using three different models (Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 pro) and three prompt types (Standard, Chain of Thought, and Self Consistency), with each combination executed three times for structured reporting of intracranial hemorrhage and skull fracture. LLMs’ performance was evaluated through accuracy metrics, statistical analysis across combinations, and error analysis
Fig. 2
Fig. 2
Flow diagram of head CT radiology report selection. From an initial dataset of 3,993 radiology reports, 44 reports were excluded based on three criteria: 10 reports had no descriptions, 9 reports lacked hemorrhage or fracture location information, and 25 reports had no interpretation of high attenuation areas. The final analysis included 3,949 radiology reports from 3,949 unique patients

References

    1. Carney N, Totten AM, O’Reilly C, Ullman JS, Hawryluk GWJ, Bell MJ, et al. Guidelines for the management of severe traumatic brain injury, Fourth Edition. Neurosurgery. 2017;80:6–15. - PubMed
    1. Stiell IG, Wells GA, Vandemheen K, Clement C, Lesiuk H, Laupacis A, et al. The Canadian CT head rule for patients with minor head injury. Lancet. 2001;357:1391–6. - PubMed
    1. Haydel MJ, Preston CA, Mills TJ, Luber S, Blaudeau E, DeBlieux PM. Indications for computed tomography in patients with minor head injury. N Engl J Med. 2000;343:100–5. - PubMed
    1. Kahn CE Jr, Langlotz CP, Burnside ES, Carrino JA, Channin DS, Hovsepian DM, et al. Toward best practices in radiology reporting. Radiology. 2009;252:852–6. - PubMed
    1. Larson DB, Towbin AJ, Pryor RM, Donnelly LF. Improving consistency in radiology reporting through the use of department-wide standardized structured reporting. Radiology. 2013;267:240–50. - PubMed