Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 26;25(1):1242.
doi: 10.1186/s12903-025-06648-1.

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Affiliations

Assessing and enhancing the reliability of Chinese large language models in dental implantology

Guohui Zhu et al. BMC Oral Health. .

Abstract

Background: This study aimed to evaluate the reliability of five representative Chinese large language models (LLMs) in dental implantology. It also explored effective strategies for model enhancement.

Methods: A dataset of 100 dental implant-related questions (50 multiple-choice and 50 open-ended) was developed, covering medical knowledge, complex reasoning, and safety and ethics. Standard answers were validated by experts. Five LLMs—A: BaiXiaoYing, B: ChatGLM-4, C: ERNIE Bot 3.5, D: Qwen 2.5, and E: Kimi.ai—were tested using two metrics: recall and hallucination rate. Two enhancement techniques, chain of thought (CoT) reasoning and long text modeling (LTM), were applied, and their effectiveness was analyzed by comparing the metrics before and after the application of these techniques. Data analysis was conducted using SPSS software. ANOVA with Tukey HSD tests compared recall and hallucination rates across models, while paired t-tests evaluated changes before and after enhancement strategies.

Results: For multiple-choice questions, Group D (Qwen 2.5) achieved the highest recall at 0.9060 ± 0.0087, while Group C (ERNIE Bot 3.5) had the lowest hallucination rate at 0.1245 ± 0.0022. For open-ended questions, Group D maintained the highest recall at 0.7938 ± 0.0216, and Group C exhibited the lowest hallucination rate at 0.2390 ± 0.0029. Among enhancement strategies, chain-of-thought (CoT) reasoning improved Group D’s recall by 0.0621 ± 0.1474 (P < 0.05) but caused a non-significant increase in hallucination rate (0.0390 ± 0.1639, P > 0.05). Long-text modeling (LTM) significantly enhanced recall by 0.1119 ± 0.2000 (P < 0.05) and reduced hallucination rate by 0.2985 ± 0.4220 (P < 0.05).

Conclusions: Qwen 2.5 and ERNIE Bot 3.5 demonstrated exceptional reliability in dental implantology, excelling in answer accuracy and minimizing misinformation across question types. Open-ended queries posed higher risks of hallucinations compared to structured multiple-choice tasks, highlighting the need for targeted validation in free-text scenarios. Chain-of-thought (CoT) reasoning modestly improved accuracy but carried the trade-off of potential hallucination increases, while long-text modeling (LTM) significantly enhanced both accuracy and reliability simultaneously. These findings underscore LTM’s utility in optimizing large language models for specialized dental applications, balancing depth of reasoning with factual grounding to support clinical decision-making and educational training.

Supplementary Information: The online version contains supplementary material available at 10.1186/s12903-025-06648-1.

Keywords: Chain-of-thought reasoning; Dental implantology; Hallucination rate; Large language models; Long-text modeling; Recall.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: This study involved non-human subjects and did not require institutional review board (IRB) approval. All data used were de-identified and anonymized to protect patient privacy. The use of LLMs complied with the respective providers’ terms of service, including API access agreements and data security protocols. Specifically, no personal health information was input into the models, and all interactions were logged for audit purposes only. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests. Clinical trial number: Not applicable.

Figures

Fig. 1
Fig. 1
Flowchart comparing the operational logic of Chain of Thought (CoT) and Long-Text Modeling (LTM) strategies
Fig. 2
Fig. 2
illustrates the experimental workflow. (First, baseline performance of all five models (Group A: BaiXiaoYing, Group B: ChatGLM-4, Group C: ERNIE Bot 3.5, Group D: Qwen 2.5, Group E: Kimi.ai) was evaluated on 100 questions. Group D was identified as the top performer and underwent CoT and LTM interventions. For each strategy, questions were retested five times under reset dialogue environments to eliminate contextual bias, with results compared against baseline metrics using paired t-tests.)
Fig. 3
Fig. 3
Bar chart comparing recall and hallucination rate of five groups in multiple choice questions
Fig. 4
Fig. 4
Bar chart comparing recall and hallucination rate of five groups in open ended questions

Similar articles

References

    1. Zheng J, Ding X, Pu JJ, Chung SM, Qi YHA, Hung KF, Shan Z. Unlocking the potentials of large language models in orthodontics: a scoping review. BMC Oral Health. 2024;24:321. 10.1186/s12903-024-04321-8 - PMC - PubMed
    1. Zhao J, Li X, Chen W, Wang Y. Survey on the availability of oral implantology training in primary healthcare facilities in China. Chin J Dent Res. 2022;25:234–45. 10.1016/j.cjdr.2022.04.003.
    1. Liu H, Wang Y, Li X, Chen W. Pilot program results of using Chinese large language models for continuing professional development in oral implantology. J Dent Educ. 2023;87:987–95. 10.1002/jdd.1361
    1. National Health Commission of China. Initiatives to promote the use of AI in medical training and practice. 2023. Available at: http://www.nhc.gov.cn/. Accessed September 6, 2024.
    1. Zhang Y, Liu H, Sun T. Evaluating the impact of chain of thought prompts on the performance of large language models in medical domains. Artif Intell Med. 2022;127:102167. 10.1016/j.artmed.2022.102167

LinkOut - more resources