Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Mar 7;25(1):117.
doi: 10.1186/s12911-025-02954-4.

A systematic review of large language model (LLM) evaluations in clinical medicine

Affiliations

A systematic review of large language model (LLM) evaluations in clinical medicine

Sina Shool et al. BMC Med Inform Decis Mak. .

Abstract

Background: Large Language Models (LLMs), advanced AI tools based on transformer architectures, demonstrate significant potential in clinical medicine by enhancing decision support, diagnostics, and medical education. However, their integration into clinical workflows requires rigorous evaluation to ensure reliability, safety, and ethical alignment.

Objective: This systematic review examines the evaluation parameters and methodologies applied to LLMs in clinical medicine, highlighting their capabilities, limitations, and application trends.

Methods: A comprehensive review of the literature was conducted across PubMed, Scopus, Web of Science, IEEE Xplore, and arXiv databases, encompassing both peer-reviewed and preprint studies. Studies were screened against predefined inclusion and exclusion criteria to identify original research evaluating LLM performance in medical contexts.

Results: The results reveal a growing interest in leveraging LLM tools in clinical settings, with 761 studies meeting the inclusion criteria. While general-domain LLMs, particularly ChatGPT and GPT-4, dominated evaluations (93.55%), medical-domain LLMs accounted for only 6.45%. Accuracy emerged as the most commonly assessed parameter (21.78%). Despite these advancements, the evidence base highlights certain limitations and biases across the included studies, emphasizing the need for careful interpretation and robust evaluation frameworks.

Conclusions: The exponential growth in LLM research underscores their transformative potential in healthcare. However, addressing challenges such as ethical risks, evaluation variability, and underrepresentation of critical specialties will be essential. Future efforts should prioritize standardized frameworks to ensure safe, effective, and equitable LLM integration in clinical practice.

Keywords: Artificial intelligence in medicine; Clinical medicine; Deep learning in healthcare; LLM evaluation; Large language models; Natural language processing; Systematic review.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
PRISMA flow diagram for systematic reviews which included searches of databases
Fig. 2
Fig. 2
Distribution of evaluation parameters in total and across groups

References

    1. Zhou H, Liu F, Gu B, Zou X, Huang J, Wu J et al. A survey of large language models in medicine: progress, application, and challenge. ArXiv Preprint. 2023;arXiv:231105112.
    1. Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33. - PMC - PubMed
    1. Tustumi F, Andreollo NA, Aguilar-Nascimento, JEd. Future of the language models in healthcare: the role of chatGPT. ABCD arquivos brasileiros de cirurgia digestiva (são paulo). 2023;36:e1727. - PMC - PubMed
    1. Wilhelm TI, Roos J, Kaczmarczyk R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res. 2023;25:e49324. - PMC - PubMed
    1. Lahat A, Klang E. Can advanced technologies help address the global increase in demand for specialized medical care and improve telehealth services? J Telemed Telecare. 2024;30(9). - PubMed

Publication types

LinkOut - more resources