Benchmarking vision-language models for diagnostics in emergency and critical care settings

Christoph F Kurz¹, Tatiana Merzhevich², Bjoern M Eskofier^{2

3}, Jakob Nikolas Kather^{4

5

6}, Benjamin Gmeiner⁷

Affiliations

¹ Novartis Pharma GmbH, Nuremberg, Germany. christoph.kurz@novartis.com.
² Machine Learning and Data Analytics (MaD) lab, Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander Universität, Erlangen-Nürnberg (FAU), Erlangen, Germany.
³ Institute of AI for Health, Helmholtz Zentrum München, Munich, Germany.
⁴ Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁵ Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁶ Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
⁷ Novartis Pharma GmbH, Nuremberg, Germany.

PMID: 40640347
PMCID: PMC12246445
DOI: 10.1038/s41746-025-01837-2

Benchmarking vision-language models for diagnostics in emergency and critical care settings

Christoph F Kurz et al. NPJ Digit Med. 2025.

. 2025 Jul 10;8(1):423.

doi: 10.1038/s41746-025-01837-2.

Authors

Christoph F Kurz¹, Tatiana Merzhevich², Bjoern M Eskofier^{2

3}, Jakob Nikolas Kather^{4

5

6}, Benjamin Gmeiner⁷

Affiliations

¹ Novartis Pharma GmbH, Nuremberg, Germany. christoph.kurz@novartis.com.
² Machine Learning and Data Analytics (MaD) lab, Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander Universität, Erlangen-Nürnberg (FAU), Erlangen, Germany.
³ Institute of AI for Health, Helmholtz Zentrum München, Munich, Germany.
⁴ Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁵ Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
⁶ Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
⁷ Novartis Pharma GmbH, Nuremberg, Germany.

PMID: 40640347
PMCID: PMC12246445
DOI: 10.1038/s41746-025-01837-2

Abstract

The applicability of vision-language models (VLMs) for acute care in emergency and intensive care units remains underexplored. Using a multimodal dataset of diagnostic questions involving medical images and clinical context, we benchmarked several small open-source VLMs against GPT-4o. While open models demonstrated limited diagnostic accuracy (up to 40.4%), GPT-4o significantly outperformed them (68.1%). Findings highlight the need for specialized training and optimization to improve open-source VLMs for acute care applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: C.F.K. and B.G. report employment and stock ownership from Novartis Pharma. B.G. is a Strategic Advisory Board Member at Fraunhofer IZI-BB. B.G. serves as an Associate Editor of npj Digital Medicine and was not involved in the review or decision-making process for this manuscript. J.N.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, J.N.K. holds shares in StratifAI, Germany, Synagen, Germany, Ignition Lab, Germany; J.N.K. has received an institutional research grant by GSK; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. T.M. and B.M.E. have nothing to declare.

Figures

**Fig. 1. Overview of the benchmarking process for evaluating VLMs on the NEJM Image Challenge.**
VLMs analyze medical images and descriptions to select the correct diagnosis from multiple-choice answers, compared against expert readers’ consensus. Medical images shown are sourced from Wikimedia Commons.

**Fig. 2. Percentage of correct answers for each model by difficulty level.**
Each bar is segmented to show results for easy (green), medium (purple), and hard (orange) questions. The total height of each bar represents the overall percentage of correct answers for that model. The horizontal dashed line indicates the random guessing threshold. Because there were five multi-choice answers in each challenge, we defined random guessing at 20% accuracy. The horizontal dotted line represents the average performance of human responders to the challenge. DeepSeek VL2 Tiny (1B), InternVL 2.5 (1B), and Smol (500 M) performed worse than random guessing with correct answer percentages lower than 20%. Granite Vision 3.2, InternVL 2.5 (2B) and Smol (2B) performed slightly above random guessing with accuracies of 23.0%, 22.6%, and 25.4%, respectively. The InternVL 2.5 and Qwen 2.5 VL model families exhibited a consistent improvement in accuracy with increased model complexity; for instance, InternVL 2.5 (4B) answered correctly in 32.6% of cases, and InternVL 2.5 (8B) showed slight improvement with an accuracy of 35.0%. The Qwen 2.5 VL (3B) model achieved a 35.6% accuracy, while Qwen 2.5 VL (7B) improved further to 40.4%. The Phi4 Multi (5B) and Gemma 3 (4B) models reached an accuracy of 33.4% and 35.5%, respectively. Notably, none of the open VLMs could compete with GPT-4o, which correctly answered more than two-thirds (68.1%) of challenge questions.

**Fig. 3. Heatmap of model correctness for each challenge question.**
The colored bars along the right-hand side classify each question’s difficulty based on the NEJM human responder accuracy, ranging from ‘hard’ (orange, ≤44%) at the top, to ‘medium’ (purple, 45–55%), and ‘easy’ (green, ≥56%) at the bottom. Each row corresponds to a single question, and each column corresponds to one of the evaluated VLMs. A dark cell indicates that the model selected the correct multiple-choice answer; a blue cell indicates that the model’s final answer was incorrect. Humans were categorized as giving the correct answer if more than 50% of NEJM readers answered correctly. The distribution of correct answers across easy, medium, and hard categories remained relatively stable, indicating that the models’ capabilities were consistent irrespective of question difficulty as perceived from a human point of view. Despite the accuracy improvements within certain model families, we noted inconsistencies. For example, the smaller InternVL 2.5 (1B) answered some challenges correctly that the larger InternVL 2.5 (2B) did not. Additionally, DeepSeek VL2 Tiny (1B) and InternVL 2.5 (1B) performed comparably and below random guessing, yet their answering patterns showed little overlap.

**Fig. 4. Correlation matrix of model response patterns across VLMs.**
The heatmap displays Phi coefficients (ranging from –1 to 1) quantifying the similarity in correctness patterns among vision-language models, with higher values indicating greater similarity. Strongest correlations (up to 0.5) occur within model families such as InternVL (1B–8B) and Qwen (3B, 7B), reflecting consistent intra-family performance. Smol models also correlate well internally but diverge from others. Granite Vision 3.2 (2B), Phi4 Multi (5B), and Gemma 3 (4B) show moderate correlations with larger InternVL and Qwen models. DeepSeek VL2 Tiny (1B) exhibits low correlation with the similar performing Smol (500 M), suggesting distinctive response patterns.

**Fig. 5. Distribution of correct answer percentages for the NEJM Image Challenge by the readers.**
The histogram shows three difficulty categories: “hard” (orange, 0–44%), “medium” (purple, 45–55%), and “easy” (green, 56–100%) and the number of questions in each category.

See this image and copyright information in PMC

References

1. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med.25, 44–56 (2019). - PubMed
1. Zhang, Z. & Ni, H. Critical care studies using large language models based on electronic healthcare records: a technical note. J. Intens. Med.10.1016/j.jointm.2024.09.002 (2024). - PMC - PubMed
1. Koga, S. & Du, W. From text to image: challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen. Res.20, https://journals.lww.com/nrronline/fulltext/2025/02000/from_text_to_imag... (2025). - PMC - PubMed
1. Clusmann J. et al. The future landscape of large language models in medicine. Commun. Med.3, 10.1038/s43856-023-00370-1 (2023). - PMC - PubMed
1. Bedi, S. et al. Testing and evaluation of health care applications of large language models. JAMA333, 319 (2025). - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking vision-language models for diagnostics in emergency and critical care settings

Affiliations

Benchmarking vision-language models for diagnostics in emergency and critical care settings

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources