Benchmarking vision-language models for diagnostics in emergency and critical care settings
- PMID: 40640347
- PMCID: PMC12246445
- DOI: 10.1038/s41746-025-01837-2
Benchmarking vision-language models for diagnostics in emergency and critical care settings
Abstract
The applicability of vision-language models (VLMs) for acute care in emergency and intensive care units remains underexplored. Using a multimodal dataset of diagnostic questions involving medical images and clinical context, we benchmarked several small open-source VLMs against GPT-4o. While open models demonstrated limited diagnostic accuracy (up to 40.4%), GPT-4o significantly outperformed them (68.1%). Findings highlight the need for specialized training and optimization to improve open-source VLMs for acute care applications.
© 2025. The Author(s).
Conflict of interest statement
Competing interests: C.F.K. and B.G. report employment and stock ownership from Novartis Pharma. B.G. is a Strategic Advisory Board Member at Fraunhofer IZI-BB. B.G. serves as an Associate Editor of npj Digital Medicine and was not involved in the review or decision-making process for this manuscript. J.N.K. declares consulting services for Bioptimus, France; Panakeia, UK; AstraZeneca, UK; and MultiplexDx, Slovakia. Furthermore, J.N.K. holds shares in StratifAI, Germany, Synagen, Germany, Ignition Lab, Germany; J.N.K. has received an institutional research grant by GSK; and has received honoraria by AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer, and Fresenius. T.M. and B.M.E. have nothing to declare.
Figures





References
-
- Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med.25, 44–56 (2019). - PubMed
-
- Koga, S. & Du, W. From text to image: challenges in integrating vision into ChatGPT for medical image interpretation. Neural Regen. Res.20, https://journals.lww.com/nrronline/fulltext/2025/02000/from_text_to_imag... (2025). - PMC - PubMed
LinkOut - more resources
Full Text Sources