The DRAGON benchmark for clinical NLP

Joeran S Bosma^{1

2

3}, Koen Dercksen⁴, Luc Builtjes⁴, Romain André⁴, Christian Roest⁵, Stefan J Fransen⁵, Constant R Noordman^{4

6}, Mar Navarro-Padilla⁴, Judith Lefkes⁷, Natália Alves⁴, Max J J de Grauw⁴, Leander van Eekelen⁷, Joey M A Spronck⁷, Megan Schuurmans⁴, Bram de Wilde⁴, Ward Hendrix⁴, Witali Aswolinskiy⁷, Anindo Saha^{4

6}, Jasper J Twilt⁶, Daan Geijs⁷, Jeroen Veltman⁸, Derya Yakar^{9

5}, Maarten de Rooij¹⁰, Francesco Ciompi⁷, Alessa Hering⁴, Jeroen Geerdink¹¹, Henkjan Huisman⁴; DRAGON consortium

Collaborators, Affiliations

Affiliations

¹ Diagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands. Joeran.Bosma@radboudumc.nl.
² Department of Health & Information Technology, Ziekenhuisgroep Twente, Almelo, The Netherlands. Joeran.Bosma@radboudumc.nl.
³ Department of Radiology, Netherlands Cancer Institute, Amsterdam, The Netherlands. Joeran.Bosma@radboudumc.nl.
⁴ Diagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
⁵ Department of Radiology, University Medical Center Groningen, Groningen, The Netherlands.
⁶ Minimally Invasive Image-Guided Intervention Center, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
⁷ Computational Pathology Group, Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands.
⁸ Department of Radiology, Ziekenhuisgroep Twente, Almelo, The Netherlands.
⁹ Department of Radiology, Netherlands Cancer Institute, Amsterdam, The Netherlands.
¹⁰ Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
¹¹ Department of Health & Information Technology, Ziekenhuisgroep Twente, Almelo, The Netherlands.

PMID: 40379835
PMCID: PMC12084576
DOI: 10.1038/s41746-025-01626-x

The DRAGON benchmark for clinical NLP

Joeran S Bosma et al. NPJ Digit Med. 2025.

. 2025 May 17;8(1):289.

doi: 10.1038/s41746-025-01626-x.

Authors

Affiliations

¹ Diagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands. Joeran.Bosma@radboudumc.nl.
² Department of Health & Information Technology, Ziekenhuisgroep Twente, Almelo, The Netherlands. Joeran.Bosma@radboudumc.nl.
³ Department of Radiology, Netherlands Cancer Institute, Amsterdam, The Netherlands. Joeran.Bosma@radboudumc.nl.
⁴ Diagnostic Image Analysis Group, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
⁵ Department of Radiology, University Medical Center Groningen, Groningen, The Netherlands.
⁶ Minimally Invasive Image-Guided Intervention Center, Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
⁷ Computational Pathology Group, Department of Pathology, Radboud University Medical Center, Nijmegen, The Netherlands.
⁸ Department of Radiology, Ziekenhuisgroep Twente, Almelo, The Netherlands.
⁹ Department of Radiology, Netherlands Cancer Institute, Amsterdam, The Netherlands.
¹⁰ Department of Medical Imaging, Radboud University Medical Center, Nijmegen, The Netherlands.
¹¹ Department of Health & Information Technology, Ziekenhuisgroep Twente, Almelo, The Netherlands.

PMID: 40379835
PMCID: PMC12084576
DOI: 10.1038/s41746-025-01626-x

Abstract

Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A.S. has received lecture honorarium from Guerbet. F.C. has been Chair of the Scientific and Medical Advisory Board of TRIBVN Healthcare, received advisory board fees from TRIBVN Healthcare, and is shareholder in Aiosyn BV.

Figures

**Fig. 1. Overview of the tasks in the DRAGON benchmark.**
Tasks are grouped by their task type. For each task, key statistics are shown: (blue) the number of development cases, (orange) the median report length, and (green) the maximum report length. The report length is expressed in the number of tokens with an xlm-roberta-base tokenizer.

**Fig. 2. Workflow for the DRAGON benchmark.**
Challenge participants must provide all resources necessary to process the reports and generate predictions for the test set. Processing of reports is performed on the Grand Challenge platform, without any interaction with the participant.

**Fig. 3. Experimental setup to compare pretraining strategies.**
Several LLM architectures are pretrained using either general-domain, domain-specific, or mixed-domain pretraining (general-domain followed by domain-specific pretraining). Each of the resulting pretrained foundational models is evaluated on the DRAGON benchmark by task-specific fine-tuning followed by performance evaluation on the test set. To assess fine-tuning stability, the training and validation datasets rotate with five-fold cross-validation, resulting in five performance assessments for each of the 28 tasks per pretrained model.

**Fig. 4. Benchmark results.**
Performance observed across each architecture, task, and training run in the DRAGON benchmark for the three pretraining strategies: (1) general-domain pretraining, (2) mixed-domain pretraining, and (3) domain-specific pretraining. Performance metrics from individual fine-tuning runs are shown as black dots (from 5 architectures, 28 tasks, and 5 runs, resulting in 700 scores per pretraining method). The diamond and error bars show the DRAGON 2025 test score (average of the score from each run) and its 95% confidence interval. The blue shading represents the density estimation of individual scores in a violin plot.

See this image and copyright information in PMC

References

1. Hricak, H. et al. Medical imaging and nuclear medicine: a Lancet Oncology Commission. Lancet Oncol.22, e136–e172 (2021). - PMC - PubMed
1. Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA A Cancer J. Clinicians71, 209–249 (2021). - PubMed
1. Netzer, N. et al. Fully Automatic Deep Learning in Bi-institutional Prostate Magnetic Resonance Imaging: Effects of Cohort Size and Heterogeneity. Invest. Radio.56, 799–808 (2021). - PubMed
1. Lång, K. et al. Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol.24, 936–944 (2023). - PubMed
1. Martin, D. D., Calder, A. D., Ranke, M. B., Binder, G. & Thodberg, H. H. Accuracy and self-validation of automated bone age determination. Sci. Rep.12, 6388 (2022). - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The DRAGON benchmark for clinical NLP

Collaborators

Affiliations

The DRAGON benchmark for clinical NLP

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources