Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Jonas Wihl¹, Enrike Rosenkranz¹, Severin Schramm¹, Cornelius Berberich¹, Michael Griessmair^{1

2}, Piotr Woźnicki³, Francisco Pinto³, Sebastian Ziegelmayer⁴, Lisa C Adams⁴, Keno K Bressem⁵, Jan S Kirschke¹, Claus Zimmer¹, Benedikt Wiestler^{1

6}, Dennis Hedderich¹, Su Hwan Kim⁷

Affiliations

¹ Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.
² Department of Diagnostic, Interventional and Pediatric Radiology, Inselspital Bern, University of Bern, Bern, Switzerland.
³ Smart Reporting GmbH, Munich, Germany.
⁴ Department of Diagnostic and Interventional Radiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁵ Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁶ AI for Image-Guided Diagnosis and Therapy, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁷ Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany. suhwan.kim@tum.de.

PMID: 40536631
PMCID: PMC12179022
DOI: 10.1186/s41747-025-00600-2

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Jonas Wihl et al. Eur Radiol Exp. 2025.

. 2025 Jun 19;9(1):61.

doi: 10.1186/s41747-025-00600-2.

Authors

Affiliations

¹ Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.
² Department of Diagnostic, Interventional and Pediatric Radiology, Inselspital Bern, University of Bern, Bern, Switzerland.
³ Smart Reporting GmbH, Munich, Germany.
⁴ Department of Diagnostic and Interventional Radiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁵ Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁶ AI for Image-Guided Diagnosis and Therapy, School of Medicine and Health, Technical University of Munich, Munich, Germany.
⁷ Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany. suhwan.kim@tum.de.

PMID: 40536631
PMCID: PMC12179022
DOI: 10.1186/s41747-025-00600-2

Abstract

Background: To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.

Methods: The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.

Results: GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.

Conclusion: GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.

Relevance statement: Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.

Key points: LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored. Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt. Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

Keywords: Artificial intelligence; Information storage and retrieval; Large language models; Stroke; Tomography (x-ray computed).

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This retrospective study was approved by the Institutional Review Board of the TUM. Consent for publication: The need for informed consent was waived (2024-125-S-NP, May 29, 2024). Competing interests: FP is a full-time employee of Smart Reporting GmbH, a provider of radiology reporting software. PW is a consultant for Smart Reporting GmbH.

Figures

**Fig. 1**
Study design. Initially, two raters annotated dataset A using a preliminary annotation guideline with a few general instructions. Based on the guideline deficiencies uncovered based on a review of cases with inter-rater disagreement, an addendum was appended to the original document, forming the final annotation guideline. The data extraction performance of GPT-4o and Llama-3.3-70B with and without the annotation guideline was evaluated in dataset A and additionally in another dataset (dataset B) that was not used to formulate the annotation guideline. At the bottom, a fictional CT report in English is shown along with the data parameters extracted from it in JSON format to illustrate the methodology

**Fig. 2**
Data extraction performance of GPT-4o and Llama-3.3-70B across all parameters. Metrics for GPT-4o queries with a temperature setting of 1 were calculated based on the mode across three repetitions, whereas the remaining queries were run only once (with a temperature setting of 0.0). Error bars indicate 95% confidence intervals. The 1.7% (34/2,000) and 1.4% (14/1,000) of data points were excluded from dataset A and dataset B, each, as the report text indicated diagnostic uncertainty without a clear positive or negative tendency (expressions such as “possible”, “DDx”). Precision: positive predictive value. Recall: sensitivity. F1-Score: harmonic mean of precision and recall

See this image and copyright information in PMC

References

1. Mokin M, Ansari SA, McTaggart RA et al (2019) Indications for thrombectomy in acute ischemic stroke from emergent large vessel occlusion (ELVO): report of the SNIS Standards and Guidelines Committee. J Neurointerv Surg 11:215–220. 10.1136/neurintsurg-2018-014640 - PubMed
1. Li MD, Lang M, Deng F et al (2021) Analysis of stroke detection during the COVID-19 pandemic using natural language processing of radiology reports. AJNR Am J Neuroradiol 42:429–434. 10.3174/AJNR.A6961 - PMC - PubMed
1. Ginsberg MD (2018) The cerebral collateral circulation: relevance to pathophysiology and treatment of stroke. Neuropharmacology 134:280–292. 10.1016/j.neuropharm.2017.08.003 - PubMed
1. Caruso P, Naccarato M, Furlanis G et al (2018) Wake-up stroke and CT perfusion: effectiveness and safety of reperfusion therapy. Neurol Sci 39:1705–1712. 10.1007/s10072-018-3486-z - PubMed
1. Moftakhar P, English JD, Cooke DL et al (2013) Density of thrombus on admission CT predicts revascularization efficacy in large vessel occlusion acute ischemic stroke. Stroke 44:243–245. 10.1161/STROKEAHA.112.674127 - PubMed

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Bern Open Repository and Information System
- PubMed Central
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Affiliations

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical