The DRAGON benchmark for clinical NLP
- PMID: 40379835
- PMCID: PMC12084576
- DOI: 10.1038/s41746-025-01626-x
The DRAGON benchmark for clinical NLP
Abstract
Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.
© 2025. The Author(s).
Conflict of interest statement
Competing interests: A.S. has received lecture honorarium from Guerbet. F.C. has been Chair of the Scientific and Medical Advisory Board of TRIBVN Healthcare, received advisory board fees from TRIBVN Healthcare, and is shareholder in Aiosyn BV.
Figures




Similar articles
-
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318. JMIR Med Inform. 2024. PMID: 38587879 Free PMC article.
-
Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study.J Med Internet Res. 2025 Mar 18;27:e66279. doi: 10.2196/66279. J Med Internet Res. 2025. PMID: 40101227 Free PMC article.
-
Evaluating large language models for health-related text classification tasks with public social media data.J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210. J Am Med Inform Assoc. 2024. PMID: 39121174 Free PMC article.
-
The role of large language models in medical image processing: a narrative review.Quant Imaging Med Surg. 2024 Jan 3;14(1):1108-1121. doi: 10.21037/qims-23-892. Epub 2023 Nov 23. Quant Imaging Med Surg. 2024. PMID: 38223123 Free PMC article. Review.
-
The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives.J Med Syst. 2024 Feb 17;48(1):22. doi: 10.1007/s10916-024-02045-3. J Med Syst. 2024. PMID: 38366043 Free PMC article. Review.
Cited by
-
Orchestrated multi agents sustain accuracy under clinical-scale workloads compared to a single agent.medRxiv [Preprint]. 2025 Aug 24:2025.08.22.25334049. doi: 10.1101/2025.08.22.25334049. medRxiv. 2025. PMID: 40894146 Free PMC article. Preprint.
References
-
- Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA A Cancer J. Clinicians71, 209–249 (2021). - PubMed
-
- Netzer, N. et al. Fully Automatic Deep Learning in Bi-institutional Prostate Magnetic Resonance Imaging: Effects of Cohort Size and Heterogeneity. Invest. Radio.56, 799–808 (2021). - PubMed
-
- Lång, K. et al. Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol.24, 936–944 (2023). - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources