Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 10;8(1):423.
doi: 10.1038/s41746-025-01837-2.

Benchmarking vision-language models for diagnostics in emergency and critical care settings

Affiliations

Benchmarking vision-language models for diagnostics in emergency and critical care settings

Christoph F Kurz et al. NPJ Digit Med. .

Abstract

The applicability of vision-language models (VLMs) for acute care in emergency and intensive care units remains underexplored. Using a multimodal dataset of diagnostic questions involving medical images and clinical context, we benchmarked several small open-source VLMs against GPT-4o. While open models demonstrated limited diagnostic accuracy (up to 40.4%), GPT-4o significantly outperformed them (68.1%). Findings highlight the need for specialized training and optimization to improve open-source VLMs for acute care applications.

PubMed Disclaimer

Conflict of interest statement

Competing interests: C.F.K. and B.G. report employment and stock ownership from Novartis Pharma. B.G. is a Strategic Advisory Board Member at Fraunhofer IZI-BB. B.G. serves as an Associate Editor of npj Digital Medicine and was not involved in the review or decision-making process for this manuscript. J.N.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, J.N.K. holds shares in StratifAI, Germany, Synagen, Germany, Ignition Lab, Germany; J.N.K. has received an institutional research grant by GSK; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. T.M. and B.M.E. have nothing to declare.

Figures

Fig. 1
Fig. 1. Overview of the benchmarking process for evaluating VLMs on the NEJM Image Challenge.
VLMs analyze medical images and descriptions to select the correct diagnosis from multiple-choice answers, compared against expert readers’ consensus. Medical images shown are sourced from Wikimedia Commons.
Fig. 2
Fig. 2. Percentage of correct answers for each model by difficulty level.
Each bar is segmented to show results for easy (green), medium (purple), and hard (orange) questions. The total height of each bar represents the overall percentage of correct answers for that model. The horizontal dashed line indicates the random guessing threshold. Because there were five multi-choice answers in each challenge, we defined random guessing at 20% accuracy. The horizontal dotted line represents the average performance of human responders to the challenge. DeepSeek VL2 Tiny (1B), InternVL 2.5 (1B), and Smol (500 M) performed worse than random guessing with correct answer percentages lower than 20%. Granite Vision 3.2, InternVL 2.5 (2B) and Smol (2B) performed slightly above random guessing with accuracies of 23.0%, 22.6%, and 25.4%, respectively. The InternVL 2.5 and Qwen 2.5 VL model families exhibited a consistent improvement in accuracy with increased model complexity; for instance, InternVL 2.5 (4B) answered correctly in 32.6% of cases, and InternVL 2.5 (8B) showed slight improvement with an accuracy of 35.0%. The Qwen 2.5 VL (3B) model achieved a 35.6% accuracy, while Qwen 2.5 VL (7B) improved further to 40.4%. The Phi4 Multi (5B) and Gemma 3 (4B) models reached an accuracy of 33.4% and 35.5%, respectively. Notably, none of the open VLMs could compete with GPT-4o, which correctly answered more than two-thirds (68.1%) of challenge questions.
Fig. 3
Fig. 3. Heatmap of model correctness for each challenge question.
The colored bars along the right-hand side classify each question’s difficulty based on the NEJM human responder accuracy, ranging from ‘hard’ (orange, ≤44%) at the top, to ‘medium’ (purple, 45–55%), and ‘easy’ (green, ≥56%) at the bottom. Each row corresponds to a single question, and each column corresponds to one of the evaluated VLMs. A dark cell indicates that the model selected the correct multiple-choice answer; a blue cell indicates that the model’s final answer was incorrect. Humans were categorized as giving the correct answer if more than 50% of NEJM readers answered correctly. The distribution of correct answers across easy, medium, and hard categories remained relatively stable, indicating that the models’ capabilities were consistent irrespective of question difficulty as perceived from a human point of view. Despite the accuracy improvements within certain model families, we noted inconsistencies. For example, the smaller InternVL 2.5 (1B) answered some challenges correctly that the larger InternVL 2.5 (2B) did not. Additionally, DeepSeek VL2 Tiny (1B) and InternVL 2.5 (1B) performed comparably and below random guessing, yet their answering patterns showed little overlap.
Fig. 4
Fig. 4. Correlation matrix of model response patterns across VLMs.
The heatmap displays Phi coefficients (ranging from –1 to 1) quantifying the similarity in correctness patterns among vision-language models, with higher values indicating greater similarity. Strongest correlations (up to 0.5) occur within model families such as InternVL (1B–8B) and Qwen (3B, 7B), reflecting consistent intra-family performance. Smol models also correlate well internally but diverge from others. Granite Vision 3.2 (2B), Phi4 Multi (5B), and Gemma 3 (4B) show moderate correlations with larger InternVL and Qwen models. DeepSeek VL2 Tiny (1B) exhibits low correlation with the similar performing Smol (500 M), suggesting distinctive response patterns.
Fig. 5
Fig. 5. Distribution of correct answer percentages for the NEJM Image Challenge by the readers.
The histogram shows three difficulty categories: “hard” (orange, 0–44%), “medium” (purple, 45–55%), and “easy” (green, 56–100%) and the number of questions in each category.

References

    1. Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med.25, 44–56 (2019). - PubMed
    1. Zhang, Z. & Ni, H. Critical care studies using large language models based on electronic healthcare records: a technical note. J. Intens. Med.10.1016/j.jointm.2024.09.002 (2024). - PMC - PubMed
    1. Koga, S. & Du, W. From text to image: challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen. Res.20, https://journals.lww.com/nrronline/fulltext/2025/02000/from_text_to_imag... (2025). - PMC - PubMed
    1. Clusmann J. et al. The future landscape of large language models in medicine. Commun. Med.3, 10.1038/s43856-023-00370-1 (2023). - PMC - PubMed
    1. Bedi, S. et al. Testing and evaluation of health care applications of large language models. JAMA333, 319 (2025). - PMC - PubMed

LinkOut - more resources