Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 17;8(1):289.
doi: 10.1038/s41746-025-01626-x.

The DRAGON benchmark for clinical NLP

Collaborators, Affiliations

The DRAGON benchmark for clinical NLP

Joeran S Bosma et al. NPJ Digit Med. .

Abstract

Artificial Intelligence can mitigate the global shortage of medical diagnostic personnel but requires large-scale annotated datasets to train clinical algorithms. Natural Language Processing (NLP), including Large Language Models (LLMs), shows great potential for annotating clinical data to facilitate algorithm development but remains underexplored due to a lack of public benchmarks. This study introduces the DRAGON challenge, a benchmark for clinical NLP with 28 tasks and 28,824 annotated medical reports from five Dutch care centers. It facilitates automated, large-scale, cost-effective data annotation. Foundational LLMs were pretrained using four million clinical reports from a sixth Dutch care center. Evaluations showed the superiority of domain-specific pretraining (DRAGON 2025 test score of 0.770) and mixed-domain pretraining (0.756), compared to general-domain pretraining (0.734, p < 0.005). While strong performance was achieved on 18/28 tasks, performance was subpar on 10/28 tasks, uncovering where innovations are needed. Benchmark, code, and foundational LLMs are publicly available.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A.S. has received lecture honorarium from Guerbet. F.C. has been Chair of the Scientific and Medical Advisory Board of TRIBVN Healthcare, received advisory board fees from TRIBVN Healthcare, and is shareholder in Aiosyn BV.

Figures

Fig. 1
Fig. 1. Overview of the tasks in the DRAGON benchmark.
Tasks are grouped by their task type. For each task, key statistics are shown: (blue) the number of development cases, (orange) the median report length, and (green) the maximum report length. The report length is expressed in the number of tokens with an xlm-roberta-base tokenizer.
Fig. 2
Fig. 2. Workflow for the DRAGON benchmark.
Challenge participants must provide all resources necessary to process the reports and generate predictions for the test set. Processing of reports is performed on the Grand Challenge platform, without any interaction with the participant.
Fig. 3
Fig. 3. Experimental setup to compare pretraining strategies.
Several LLM architectures are pretrained using either general-domain, domain-specific, or mixed-domain pretraining (general-domain followed by domain-specific pretraining). Each of the resulting pretrained foundational models is evaluated on the DRAGON benchmark by task-specific fine-tuning followed by performance evaluation on the test set. To assess fine-tuning stability, the training and validation datasets rotate with five-fold cross-validation, resulting in five performance assessments for each of the 28 tasks per pretrained model.
Fig. 4
Fig. 4. Benchmark results.
Performance observed across each architecture, task, and training run in the DRAGON benchmark for the three pretraining strategies: (1) general-domain pretraining, (2) mixed-domain pretraining, and (3) domain-specific pretraining. Performance metrics from individual fine-tuning runs are shown as black dots (from 5 architectures, 28 tasks, and 5 runs, resulting in 700 scores per pretraining method). The diamond and error bars show the DRAGON 2025 test score (average of the score from each run) and its 95% confidence interval. The blue shading represents the density estimation of individual scores in a violin plot.

Similar articles

Cited by

References

    1. Hricak, H. et al. Medical imaging and nuclear medicine: a Lancet Oncology Commission. Lancet Oncol.22, e136–e172 (2021). - PMC - PubMed
    1. Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA A Cancer J. Clinicians71, 209–249 (2021). - PubMed
    1. Netzer, N. et al. Fully Automatic Deep Learning in Bi-institutional Prostate Magnetic Resonance Imaging: Effects of Cohort Size and Heterogeneity. Invest. Radio.56, 799–808 (2021). - PubMed
    1. Lång, K. et al. Artificial intelligence-supported screen reading versus standard double reading in the Mammography Screening with Artificial Intelligence trial (MASAI): a clinical safety analysis of a randomised, controlled, non-inferiority, single-blinded, screening accuracy study. Lancet Oncol.24, 936–944 (2023). - PubMed
    1. Martin, D. D., Calder, A. D., Ranke, M. B., Binder, G. & Thodberg, H. H. Accuracy and self-validation of automated bone age determination. Sci. Rep.12, 6388 (2022). - PMC - PubMed

LinkOut - more resources