Automatic structuring of radiology reports with on-premise open-source large language models

Affiliations

¹ Department of Diagnostic and Interventional Radiology, University Hospital Würzburg, Würzburg, Germany. woznicki_p@ukw.de.
² Department of Diagnostic and Interventional Radiology, University Hospital Würzburg, Würzburg, Germany.
³ Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
⁴ Department of Internal Medicine III, Heidelberg University Hospital, Heidelberg, Germany.
⁵ DZHK (German Centre for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Heidelberg, Germany.
⁶ Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany.
⁷ Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁸ Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
⁹ Institute of Pathology, University Medical Center Mainz, Mainz, Germany.
¹⁰ Institute of Radiology and Nuclear Medicine, Cantonal Hospital Baselland, Liestal, Switzerland.
¹¹ Department of Diagnostic and Interventional Radiology, University of Cologne, Cologne, Germany.
¹² Department of Radiology, University Hospital of Frankfurt, Frankfurt, Germany.

PMID: 39390261
PMCID: PMC11913902
DOI: 10.1007/s00330-024-11074-y

Automatic structuring of radiology reports with on-premise open-source large language models

Piotr Woźnicki et al. Eur Radiol. 2025 Apr.

. 2025 Apr;35(4):2018-2029.

doi: 10.1007/s00330-024-11074-y. Epub 2024 Oct 10.

Authors

Affiliations

¹ Department of Diagnostic and Interventional Radiology, University Hospital Würzburg, Würzburg, Germany. woznicki_p@ukw.de.
² Department of Diagnostic and Interventional Radiology, University Hospital Würzburg, Würzburg, Germany.
³ Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany.
⁴ Department of Internal Medicine III, Heidelberg University Hospital, Heidelberg, Germany.
⁵ DZHK (German Centre for Cardiovascular Research), Partner Site Heidelberg/Mannheim, Heidelberg, Germany.
⁶ Department of Internal Medicine I, University Hospital Carl Gustav Carus, Technical University Dresden, Dresden, Germany.
⁷ Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁸ Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
⁹ Institute of Pathology, University Medical Center Mainz, Mainz, Germany.
¹⁰ Institute of Radiology and Nuclear Medicine, Cantonal Hospital Baselland, Liestal, Switzerland.
¹¹ Department of Diagnostic and Interventional Radiology, University of Cologne, Cologne, Germany.
¹² Department of Radiology, University Hospital of Frankfurt, Frankfurt, Germany.

PMID: 39390261
PMCID: PMC11913902
DOI: 10.1007/s00330-024-11074-y

Abstract

Objectives: Structured reporting enhances comparability, readability, and content detail. Large language models (LLMs) could convert free text into structured data without disrupting radiologists' reporting workflow. This study evaluated an on-premise, privacy-preserving LLM for automatically structuring free-text radiology reports.

Materials and methods: We developed an approach to controlling the LLM output, ensuring the validity and completeness of structured reports produced by a locally hosted Llama-2-70B-chat model. A dataset with de-identified narrative chest radiograph (CXR) reports was compiled retrospectively. It included 202 English reports from a publicly available MIMIC-CXR dataset and 197 German reports from our university hospital. Senior radiologist prepared a detailed, fully structured reporting template with 48 question-answer pairs. All reports were independently structured by the LLM and two human readers. Bayesian inference (Markov chain Monte Carlo sampling) was used to estimate the distributions of Matthews correlation coefficient (MCC), with [-0.05, 0.05] as the region of practical equivalence (ROPE).

Results: The LLM generated valid structured reports in all cases, achieving an average MCC of 0.75 (94% HDI: 0.70-0.80) and F1 score of 0.70 (0.70-0.80) for English, and 0.66 (0.62-0.70) and 0.68 (0.64-0.72) for German reports, respectively. The MCC differences between LLM and humans were within ROPE for both languages: 0.01 (-0.05 to 0.07), 0.01 (-0.05 to 0.07) for English, and -0.01 (-0.07 to 0.05), 0.00 (-0.06 to 0.06) for German, indicating approximately comparable performance.

Conclusion: Locally hosted, open-source LLMs can automatically structure free-text radiology reports with approximately human accuracy. However, the understanding of semantics varied across languages and imaging findings.

Key points: Question Why has structured reporting not been widely adopted in radiology despite clear benefits and how can we improve this? Findings A locally hosted large language model successfully structured narrative reports, showing variation between languages and findings. Critical relevance Structured reporting provides many benefits, but its integration into the clinical routine is limited. Automating the extraction of structured information from radiology reports enables the capture of structured data while allowing the radiologist to maintain their reporting workflow.

Keywords: Chest radiography; Large language models; Structured reporting.

PubMed Disclaimer

Conflict of interest statement

Compliance with ethical standards. Guarantor: The scientific guarantor of this publication is P.W. and F.C.L. Conflict of interest: The authors of this manuscript declare relationships with the following companies: P.W. is a consultant at Smart Reporting GmbH. D.T. holds shares in StratifAI GmbH and has received honoraria for lectures by Bayer AG. B.B. is Founder and CEO of LernRad GmbH and has received speaker honoraria from Bayer Vital GmbH. T.A.D. is a Scientific Editorial Board member of European Radiology, and D.P.D.S. is a Deputy Editor of European Radiology; they have not taken part in this paper’s review and decision process. Statistics and biometry: One of the authors (F.C.L.) has significant statistical expertise. Informed consent: Written informed consent was waived by the Institutional Review Board. Ethical approval: Institutional Review Board approval was obtained (nr: 20221004-02). Study subjects or cohorts overlap: The MIMIC chest X-ray (MIMIC-CXR) cohort was published in 2019 ( https://doi.org/10.1038/s41597-019-0322-0 ). It is publicly available. Methodology: Retrospective Experimental Performed at one institution

Figures

**Fig. 1**
Reporting template. The template consists of 48 question-answer pairs and includes questions with a binary answer (possible answers: finding present or absent, hidden for clarity), marked with question marks, and questions with specified answer options, marked with a colon (possible answers provided after the colon). The template includes nested questions, answered only if the parent finding is present

**Fig. 2**
Study overview. Chest radiography reports from two sources were analyzed: MIMIC-CXR (English) and UH (German). The open-access Llama-2-70B model was used to extract structured elements from free-text radiology reports. The results of the automated structuring were compared with human readers. Llama-2-70B image was generated using GPT-4 through https://chat.openai.com/ on 11.11.2023. MIMIC-CXR, MIMIC chest X-ray cohort; UH, University Hospital cohort

**Fig. 3**
Distribution of Matthews correlation coefficient (MCC) for Llama-2-70B. The kernel density plot presents the posterior distribution of the MCC with the 94% highest density interval. Rhomboid markers denote quartiles. The red distributions represent the cumulative MCC across all findings in a template section. MIMIC-CXR, MIMIC chest X-ray cohort; UH, University Hospital cohort

**Fig. 4**
Distribution of pairwise differences in Matthews correlation coefficient (MCC). The kernel density plot shows the posterior distribution of the MCC pairwise differences with the 94% highest density interval. Rhomboid markers denote quartiles. The green vertical shaded area is the region of practical equivalence (−0.05, 0.05). The red distributions represent the cumulative differences across all labels. MIMIC-CXR, MIMIC chest X-ray cohort; UH, University Hospital cohort

See this image and copyright information in PMC

References

1. Nobel JM, Kok EM, Robben SGF (2020) Redefining the structure of structured reporting in radiology. Insights Imaging 11:10. 10.1186/s13244-019-0831-6 - PMC - PubMed
1. European Society of Radiology (ESR) (2018) ESR paper on structured reporting in radiology. Insights Imaging 9:1–7. 10.1007/s13244-017-0588-8 - PMC - PubMed
1. Fink MA (2023) From data to insights: how natural language processing and structured reporting advance data-driven radiology. Eur Radiol 33:7494–7495. 10.1007/s00330-023-10242-w - PMC - PubMed
1. Weiss DL, Kim W, Branstetter BF, Prevedello LM (2014) Radiology reporting: a closed-loop cycle from order entry to results communication. J Am Coll Radiol 11:1226–1237. 10.1016/j.jacr.2014.09.009 - PubMed
1. Granata V, Muzio FD, Cutolo C et al (2022) Structured reporting in radiological settings: pitfalls and perspectives. J Pers Med 12:1344. 10.3390/jpm12081344 - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

01KD2215A/Bundesministerium für Bildung und Forschung

LinkOut - more resources

Full Text Sources
- PubMed Central
- Springer

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automatic structuring of radiology reports with on-premise open-source large language models

Affiliations

Automatic structuring of radiology reports with on-premise open-source large language models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources