Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 25:16:1649041.
doi: 10.3389/fphar.2025.1649041. eCollection 2025.

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis

Affiliations

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis

Weilong Zhao et al. Front Pharmacol. .

Abstract

Objective: Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain. This study aims to assess the adherence of LLMs to Clinical Practice Guidelines (CPGs) in CM.

Methods: This cross-sectional study randomly selected ten CPGs in CM and constructed 150 questions across three categories: medication based on differential diagnosis (MDD), specific prescription consultation (SPC), and CM theory analysis (CTA). Eight LLMs (GPT-4o, Claude-3.5 Sonnet, Moonshot-v1, ChatGLM-4, DeepSeek-v3, DeepSeek-r1, Claude-4 sonnet, and Claude-4 sonnet thinking) were evaluated using both English and Chinese queries. The main evaluation metrics included accuracy, readability, and use of safety disclaimers.

Results: Overall, DeepSeek-v3 and DeepSeek-r1 demonstrated superior performance in both English (median 5.00, interquartile range (IQR) 4.00-5.00 vs. median 5.00, IQR 3.70-5.00) and Chinese (both median 5.00, IQR 4.30-5.00), significantly outperforming all other models. All models achieved significantly higher accuracy in Chinese versus English responses (all p < 0.05). Significant variations in accuracy were observed across the categories of questions, with MDD and SPC questions presenting more challenges than CTA questions. English responses had lower readability (mean flesch reading ease score 32.7) compared to Chinese responses. Moonshot-v1 provided the highest rate of safety disclaimers (98.7% English, 100% Chinese).

Conclusion: LLMs showed varying degrees of potential for acquiring CM knowledge. The performance of DeepSeek-v3 and DeepSeek-r1 is satisfactory. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.

Keywords: Chinese medicine; clinical practice guideline; comparison; knowledge acquisition; large language model.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flow diagram of the study process. CM, Chinese Medicine; LLMs, large language models.
FIGURE 2
FIGURE 2
Mean accuracy scores for large language models in English and Chinese responses. Bars represent mean scores, and error bars indicate standard deviation.
FIGURE 3
FIGURE 3
Comparison of the scores for the different categories of questions in English (a) and Chinese (b). MDD, Medication based on Differential Diagnosis; SPC, Specific Prescription Consultation; CTA, CM Theory Analysis. Each axis represents the mean accuracy score for a specific question category. The area covered by each LLM’s polygon indicates its overall performance across categories.
FIGURE 4
FIGURE 4
Number of safety disclaimers included in the LLMs’ responses. LLMs, large language models.

Similar articles

References

    1. Augustin Y., Staines H. M., Krishna S. (2020). Artemisinins as a novel anti-cancer therapy: targeting a global cancer pandemic through drug repurposing. Pharmacol. Ther. 216, 107706. 10.1016/j.pharmthera.2020.107706 - DOI - PMC - PubMed
    1. Beam K., Sharma P., Kumar B., Wang C., Brodsky D., Martin C. R., et al. (2023). Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr. 177, 977–979. 10.1001/jamapediatrics.2023.2373 - DOI - PMC - PubMed
    1. Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x - DOI
    1. Bhayana R., Biswas S., Cook T. S., Kim W., Kitamura F. C., Gichoya J., et al. (2024). From bench to bedside with large language models: AJR expert panel narrative review. Am. J. Roentgenol. 223, e2430928. 10.2214/AJR.24.30928 - DOI - PubMed
    1. Chen C., Wang X., Guan M., Yue W., Wu Y., Zhou Y., et al. (2025). Evaluating and improving syndrome differentiation thinking ability in large language models: method development study. JMIR Med. Inf. 13, e75103. 10.2196/75103 - DOI - PMC - PubMed

LinkOut - more resources