. 2025 Jul 26;25(1):1242.

doi: 10.1186/s12903-025-06648-1.

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Guohui Zhu^{1

2}, Xiao Zhang³, Chunxia Chen^{4

5}

Affiliations

¹ Department of Prosthodontics I, Tianjin Stomatological Hospital, School of Medicine, Nankai University, Tianjin, 300041, China.
² Tianjin Key Laboratory of Oral and Maxillofacial Function Reconstruction, Tianjin, 300041, China.
³ Renmin University of China, Beijing, 100872, China.
⁴ Department of Prosthodontics I, Tianjin Stomatological Hospital, School of Medicine, Nankai University, Tianjin, 300041, China. my_tjskqyy@163.com.
⁵ Tianjin Key Laboratory of Oral and Maxillofacial Function Reconstruction, Tianjin, 300041, China. my_tjskqyy@163.com.

PMID: 40713564
PMCID: PMC12296590
DOI: 10.1186/s12903-025-06648-1

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Guohui Zhu et al. BMC Oral Health. 2025.

. 2025 Jul 26;25(1):1242.

doi: 10.1186/s12903-025-06648-1.

Authors

Guohui Zhu^{1

2}, Xiao Zhang³, Chunxia Chen^{4

5}

Affiliations

¹ Department of Prosthodontics I, Tianjin Stomatological Hospital, School of Medicine, Nankai University, Tianjin, 300041, China.
² Tianjin Key Laboratory of Oral and Maxillofacial Function Reconstruction, Tianjin, 300041, China.
³ Renmin University of China, Beijing, 100872, China.
⁴ Department of Prosthodontics I, Tianjin Stomatological Hospital, School of Medicine, Nankai University, Tianjin, 300041, China. my_tjskqyy@163.com.
⁵ Tianjin Key Laboratory of Oral and Maxillofacial Function Reconstruction, Tianjin, 300041, China. my_tjskqyy@163.com.

PMID: 40713564
PMCID: PMC12296590
DOI: 10.1186/s12903-025-06648-1

Abstract

Background: This study aimed to evaluate the reliability of five representative Chinese large language models (LLMs) in dental implantology. It also explored effective strategies for model enhancement.

Methods: A dataset of 100 dental implant-related questions (50 multiple-choice and 50 open-ended) was developed, covering medical knowledge, complex reasoning, and safety and ethics. Standard answers were validated by experts. Five LLMs—A: BaiXiaoYing, B: ChatGLM-4, C: ERNIE Bot 3.5, D: Qwen 2.5, and E: Kimi.ai—were tested using two metrics: recall and hallucination rate. Two enhancement techniques, chain of thought (CoT) reasoning and long text modeling (LTM), were applied, and their effectiveness was analyzed by comparing the metrics before and after the application of these techniques. Data analysis was conducted using SPSS software. ANOVA with Tukey HSD tests compared recall and hallucination rates across models, while paired t-tests evaluated changes before and after enhancement strategies.

Results: For multiple-choice questions, Group D (Qwen 2.5) achieved the highest recall at 0.9060 ± 0.0087, while Group C (ERNIE Bot 3.5) had the lowest hallucination rate at 0.1245 ± 0.0022. For open-ended questions, Group D maintained the highest recall at 0.7938 ± 0.0216, and Group C exhibited the lowest hallucination rate at 0.2390 ± 0.0029. Among enhancement strategies, chain-of-thought (CoT) reasoning improved Group D’s recall by 0.0621 ± 0.1474 (P < 0.05) but caused a non-significant increase in hallucination rate (0.0390 ± 0.1639, P > 0.05). Long-text modeling (LTM) significantly enhanced recall by 0.1119 ± 0.2000 (P < 0.05) and reduced hallucination rate by 0.2985 ± 0.4220 (P < 0.05).

Conclusions: Qwen 2.5 and ERNIE Bot 3.5 demonstrated exceptional reliability in dental implantology, excelling in answer accuracy and minimizing misinformation across question types. Open-ended queries posed higher risks of hallucinations compared to structured multiple-choice tasks, highlighting the need for targeted validation in free-text scenarios. Chain-of-thought (CoT) reasoning modestly improved accuracy but carried the trade-off of potential hallucination increases, while long-text modeling (LTM) significantly enhanced both accuracy and reliability simultaneously. These findings underscore LTM’s utility in optimizing large language models for specialized dental applications, balancing depth of reasoning with factual grounding to support clinical decision-making and educational training.

Supplementary Information: The online version contains supplementary material available at 10.1186/s12903-025-06648-1.

Keywords: Chain-of-thought reasoning; Dental implantology; Hallucination rate; Large language models; Long-text modeling; Recall.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This study involved non-human subjects and did not require institutional review board (IRB) approval. All data used were de-identified and anonymized to protect patient privacy. The use of LLMs complied with the respective providers’ terms of service, including API access agreements and data security protocols. Specifically, no personal health information was input into the models, and all interactions were logged for audit purposes only. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests. Clinical trial number: Not applicable.

Figures

**Fig. 1**
Flowchart comparing the operational logic of Chain of Thought (CoT) and Long-Text Modeling (LTM) strategies

**Fig. 2**
illustrates the experimental workflow. (First, baseline performance of all five models (Group A: BaiXiaoYing, Group B: ChatGLM-4, Group C: ERNIE Bot 3.5, Group D: Qwen 2.5, Group E: Kimi.ai) was evaluated on 100 questions. Group D was identified as the top performer and underwent CoT and LTM interventions. For each strategy, questions were retested five times under reset dialogue environments to eliminate contextual bias, with results compared against baseline metrics using paired t-tests.)

**Fig. 3**
Bar chart comparing recall and hallucination rate of five groups in multiple choice questions

**Fig. 4**
Bar chart comparing recall and hallucination rate of five groups in open ended questions

See this image and copyright information in PMC

References

1. Zheng J, Ding X, Pu JJ, Chung SM, Qi YHA, Hung KF, Shan Z. Unlocking the potentials of large language models in orthodontics: a scoping review. BMC Oral Health. 2024;24:321. 10.1186/s12903-024-04321-8 - PMC - PubMed
1. Zhao J, Li X, Chen W, Wang Y. Survey on the availability of oral implantology training in primary healthcare facilities in China. Chin J Dent Res. 2022;25:234–45. 10.1016/j.cjdr.2022.04.003.
1. Liu H, Wang Y, Li X, Chen W. Pilot program results of using Chinese large language models for continuing professional development in oral implantology. J Dent Educ. 2023;87:987–95. 10.1002/jdd.1361
1. National Health Commission of China. Initiatives to promote the use of AI in medical training and practice. 2023. Available at: http://www.nhc.gov.cn/. Accessed September 6, 2024.
1. Zhang Y, Liu H, Sun T. Evaluating the impact of chain of thought prompts on the performance of large language models in medical domains. Artif Intell Med. 2022;127:102167. 10.1016/j.artmed.2022.102167

Grants and funding

LinkOut - more resources

Full Text Sources
- BioMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Affiliations

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous