. 2025 May 30;15(1):19028.

doi: 10.1038/s41598-025-04309-5.

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages

Zexi Li^#¹, Chunyi Yan^#², Ying Cao¹, Aobo Gong¹, Fanghui Li¹, Rui Zeng³

Affiliations

¹ Department of Cardiology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.
² Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.
³ Department of Cardiology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China. zengrui_0524@126.com.

^# Contributed equally.

PMID: 40447746
PMCID: PMC12125184
DOI: 10.1038/s41598-025-04309-5

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages

Zexi Li et al. Sci Rep. 2025.

. 2025 May 30;15(1):19028.

doi: 10.1038/s41598-025-04309-5.

Authors

Zexi Li^#¹, Chunyi Yan^#², Ying Cao¹, Aobo Gong¹, Fanghui Li¹, Rui Zeng³

Affiliations

¹ Department of Cardiology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.
² Department of Pediatric Cardiology, West China Second University Hospital, Sichuan University, Chengdu, 610041, Sichuan, China.
³ Department of Cardiology, West China Hospital, Sichuan University, Chengdu, 610041, Sichuan, China. zengrui_0524@126.com.

^# Contributed equally.

PMID: 40447746
PMCID: PMC12125184
DOI: 10.1038/s41598-025-04309-5

Abstract

This study evaluated large language models (LLMs) using 30 questions, each derived from a recommendation in the 2024 European Society of Cardiology (ESC) guidelines for atrial fibrillation (AF) management. These recommendations were stratified by class of recommendation and level of evidence. The primary objective was to assess the reliability and consistency of LLM-generated classifications compared to those in the ESC guidelines. Additionally, the study assessed the impact of different prompting strategies and working languages on LLM performance. Three prompting strategies were tested: Input-output (IO), 0-shot-Chain of thought (0-COT) and Performed-Chain of thought (P-COT) prompting. Each question, presented in both English and Chinese, was input into three LLMs: ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The reliability of the different LLM-prompt combinations showed moderate to substantial agreement (Fleiss kappa ranged from 0.449 to 0.763). Claude 3.5 with P-COT prompting had the highest recommendation classification consistency (60.3%). No significant differences were observed between English and Chinese across most LLM-prompt combinations. Bias analysis of inconsistent outcomes revealed a propensity towards more recommended treatments and stronger evidence levels across most LLM-prompt combinations. The characteristics of clinical questions potentially influence LLM performance. This study highlights the limitations in the accuracy of LLM responses to AF-related questions. To gather more comprehensive insights, conducting repeated queries is advisable. Future efforts should focus on expanding the use of diverse prompting strategies, conducting ongoing model evaluation and refinement, and establishing a comprehensive, objective benchmarking system.

Keywords: Artificial intelligence; Atrial fibrillation; ChatGPT; Large language models; Prompt engineering.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

**Fig. 1**
Recommendation classification consistency across different prompts in different models.

**Fig. 2**
Evidence level and dual-dimension consistency across different prompts in different models.

**Fig. 3**
Bias analysis in inconsistent outcomes. (A) Analysis of recommendation classifications; (B) Analysis of evidence levels.

**Fig. 4**
Heatmap of recommendation classification consistency of different LLMs addressing questions at different levels/classes is shown in Fig. 4.

See this image and copyright information in PMC

References

1. Holzinger, A., Keiblinger, K., Holub, P., Zatloukal, K. & Müller, H. AI for life: trends in artificial intelligence for biotechnology. New Biotechnol.74, 16–24. 10.1016/j.nbt.2023.02.001 (2023). - DOI - PubMed
1. Lu, Y., Wu, H., Qi, S. & Cheng, K. Artificial intelligence in intensive care medicine: toward a ChatGPT/GPT-4 way?? Ann. Biomed. Eng.51, 1898–1903. 10.1007/s10439-023-03234-w (2023). - DOI - PMC - PubMed
1. Sufi, F. Generative pre-trained transformer (GPT) in research: A systematic review on data augmentation. 15 99 (2024).
1. Kuroiwa, T. et al. The potential of ChatGPT as a Self-Diagnostic tool in common orthopedic diseases: exploratory study. J. Med. Internet. Res.25, e47621. 10.2196/47621 (2023). - DOI - PMC - PubMed
1. Zandi, R. et al. Exploring diagnostic precision and triage proficiency: A comparative study of GPT-4 and bard in addressing common ophthalmic complaints. Bioengineering (Basel Switzerland)1110.3390/bioengineering11020120 (2024). - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages

Affiliations

Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous