A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians
- PMID: 40121370
- PMCID: PMC11929846
- DOI: 10.1038/s41746-025-01543-z
A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians
Abstract
While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians has not been extensively explored. We conducted a systematic review and meta-analysis of studies validating generative AI models for diagnostic tasks published between June 2018 and June 2024. Analysis of 83 studies revealed an overall diagnostic accuracy of 52.1%. No significant performance difference was found between AI models and physicians overall (p = 0.10) or non-expert physicians (p = 0.93). However, AI models performed significantly worse than expert physicians (p = 0.007). Several models demonstrated slightly higher performance compared to non-experts, although the differences were not significant. Generative AI demonstrates promising diagnostic capabilities with accuracy varying by model. Although it has not yet achieved expert-level reliability, these findings suggest potential for enhancing healthcare delivery and medical education when implemented with appropriate understanding of its limitations.
© 2025. The Author(s).
Conflict of interest statement
Competing interests: The authors declare no competing interests.
Figures
References
-
- Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving Language Understanding by Generative Pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (2018).
-
- Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst.33, 1877–1901 (2020).
-
- OpenAI et al. GPT-4 technical report. Preprint at arXiv10.48550/arXiv.2303.08774 (2023).
-
- Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at arXiv10.48550/arXiv.2302.13971 (2023).
-
- Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at arXiv10.48550/arXiv.2307.09288 (2023).
LinkOut - more resources
Full Text Sources
