Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines
- PMID: 40536631
- PMCID: PMC12179022
- DOI: 10.1186/s41747-025-00600-2
Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines
Abstract
Background: To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.
Methods: The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.
Results: GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.
Conclusion: GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.
Relevance statement: Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.
Key points: LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.
Keywords: Artificial intelligence; Information storage and retrieval; Large language models; Stroke; Tomography (x-ray computed).
© 2025. The Author(s).
Conflict of interest statement
Declarations. Ethics approval and consent to participate: This retrospective study was approved by the Institutional Review Board of the TUM. Consent for publication: The need for informed consent was waived (2024-125-S-NP, May 29, 2024). Competing interests: FP is a full-time employee of Smart Reporting GmbH, a provider of radiology reporting software. PW is a consultant for Smart Reporting GmbH.
Figures


Similar articles
-
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638. J Med Internet Res. 2025. PMID: 40499132 Free PMC article.
-
Comparison of a Specialized Large Language Model with GPT-4o for CT and MRI Radiology Report Summarization.Radiology. 2025 Aug;316(2):e243774. doi: 10.1148/radiol.243774. Radiology. 2025. PMID: 40892451
-
Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models.Eur Radiol. 2025 Aug;35(8):4549-4557. doi: 10.1007/s00330-025-11438-y. Epub 2025 Feb 20. Eur Radiol. 2025. PMID: 39979623 Free PMC article.
-
Large language models in neurosurgery: a systematic review and meta-analysis.Acta Neurochir (Wien). 2024 Nov 23;166(1):475. doi: 10.1007/s00701-024-06372-9. Acta Neurochir (Wien). 2024. PMID: 39579215
-
Leveraging open-source large language models (LLMs) in scoping reviews: a case study on disability and AI applications.Int J Med Inform. 2025 Dec;204:106048. doi: 10.1016/j.ijmedinf.2025.106048. Epub 2025 Jul 23. Int J Med Inform. 2025. PMID: 40729777
References
-
- Mokin M, Ansari SA, McTaggart RA et al (2019) Indications for thrombectomy in acute ischemic stroke from emergent large vessel occlusion (ELVO): report of the SNIS Standards and Guidelines Committee. J Neurointerv Surg 11:215–220. 10.1136/neurintsurg-2018-014640 - PubMed
-
- Ginsberg MD (2018) The cerebral collateral circulation: relevance to pathophysiology and treatment of stroke. Neuropharmacology 134:280–292. 10.1016/j.neuropharm.2017.08.003 - PubMed
-
- Caruso P, Naccarato M, Furlanis G et al (2018) Wake-up stroke and CT perfusion: effectiveness and safety of reperfusion therapy. Neurol Sci 39:1705–1712. 10.1007/s10072-018-3486-z - PubMed
-
- Moftakhar P, English JD, Cooke DL et al (2013) Density of thrombus on admission CT predicts revascularization efficacy in large vessel occlusion acute ischemic stroke. Stroke 44:243–245. 10.1161/STROKEAHA.112.674127 - PubMed