Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 19:12:1670824.
doi: 10.3389/fmed.2025.1670824. eCollection 2025.

Clinical applications of large language models in knee osteoarthritis: a systematic review

Affiliations

Clinical applications of large language models in knee osteoarthritis: a systematic review

Zebing Ma et al. Front Med (Lausanne). .

Abstract

Background and aims: Knee osteoarthritis (KOA) is a common chronic degenerative disease that significantly impacts patients' quality of life. With the rapid advancement of artificial intelligence, large language models (LLMs) have demonstrated potential in supporting medical information extraction, clinical decision-making, and patient education through their natural language processing capabilities. However, the current landscape of LLM applications in the KOA domain, along with their methodological quality, has yet to be systematically reviewed. Therefore, this systematic review aims to comprehensively summarize existing clinical studies on LLMs in KOA, evaluate their performance and methodological rigor, and identify current challenges and future research directions.

Methods: Following the PRISMA guidelines, a systematic search was conducted in PubMed, Cochrane Library, Embase databases and Web of science for literature published up to June 2025. The protocol was preregistered on the OSF platform. Studies were screened using standardized inclusion and exclusion criteria. Key study characteristics and performance evaluation metrics were extracted. Methodological quality was assessed using tools such as Cochrane RoB, STROBE, STARD, and DISCERN. Additionally, the CLEAR-LLM and CliMA-10 frameworks were applied to provide complementary evaluations of quality and performance.

Results: A total of 16 studies were included, covering various LLMs such as ChatGPT, Gemini, and Claude. Application scenarios encompassed text generation, imaging diagnostics, and patient education. Most studies were observational in nature, and overall methodological quality ranged from moderate to high. Based on CliMA-10 scores, LLMs exhibited upper-moderate performance in KOA-related tasks. The ChatGPT-4 series consistently outperformed other models, especially in structured output generation, interpretation of clinical terminology, and content accuracy. Key limitations included insufficient sample representativeness, inconsistent control over hallucinated content, and the lack of standardized evaluation tools.

Conclusion: Large language models show notable potential in the KOA field, but their clinical application is still exploratory and limited by issues such as sample bias and methodological heterogeneity. Model performance varies across tasks, underscoring the need for improved prompt design and standardized evaluation frameworks. With real-world data and ethical oversight, LLMs may contribute more significantly to personalized KOA management.

Systematic review registration: https://osf.io/jy4kz, identifier 10.17605/OSF.IO/479R8.

Keywords: ChatGPT; artificial intelligence; clinical decision support; knee osteoarthritis; large language models; systematic review.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flowchart of the study selection process.
FIGURE 2
FIGURE 2
Risk assessment of CLEAR-LLM. R1–16, research 1–16; A, clarity of research objectives; B, control design; C, data sources and transparency; D, model description; E, prompt design; F, role of human evaluators; G, output evaluation and quantification; H, patient-relevance indicators; I, sample size and representativeness; J, bias control; K, ethical considerations; L, discussion of limitations.
FIGURE 3
FIGURE 3
Performance heatmap of LLMs in KOA in current research. Higher values indicate better performance. R1–16, research 1–16; a, accuracy of medical content; b, contextual coherence; c, interpretability of medical terminology; d, clinical usefulness; e, hallucination control; f, safety and ethical compliance; g, structured output; h–j, flexible dimensions (detailed in Supplementary Materials); k, overall composite score. ChatGPT-4* , LLM fine-tuned based on KOA knowledge.

References

    1. GBD 2021 Osteoarthritis Collaborators. Global, regional, and national burden of osteoarthritis, 1990-2020 and projections to 2050: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Rheumatol. (2023). 5:e508–22. 10.1016/S2665-9913(23)00163-7 - DOI - PMC - PubMed
    1. Sharma L. Osteoarthritis of the Knee. N Engl J Med. (2021) 384:51–9. 10.1056/NEJMcp1903768 - DOI - PubMed
    1. Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. (2023) 330:866–9. 10.1001/jama.2023.14217 - DOI - PubMed
    1. Sterne JAC, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. (2019) 366:l4898. 10.1136/bmj.l4898 - DOI - PubMed
    1. Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig L, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. (2015) 351:h5527. 10.1136/bmj.h5527 - DOI - PMC - PubMed

Publication types

LinkOut - more resources