Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Nov 21.
doi: 10.25270/jic/25.00104. Online ahead of print.

Performance of large language models in interventional cardiology: the ILLUMINATE blinded model-comparison study

Affiliations
Free article

Performance of large language models in interventional cardiology: the ILLUMINATE blinded model-comparison study

Attilio Lauretti et al. J Invasive Cardiol. .
Free article

Abstract

Objectives: Large language models (LLMs) have the potential to assist in complex decision making for interventional cardiology (IC). However, their comparative performance in providing clinical recommendations remains uncertain. In this blinded model‑comparison study, the authors evaluated and compared the quality of recommendations produced by 6 LLMs for complex IC cases.

Methods: Twenty detailed and complex clinical cases focusing on coronary artery disease (n=10) and structural heart disease (n=10) were developed. Six LLMs were tested: default ChatGPT (ChatGPTd), ChatGPT with European Society of Cardiology guidelines (ChatGPT-gl), ChatGPT with internet search enabled (ChatGPTi), Gemini (Google), Mistral 7B (Mistral AI), and Perplexity AI (Perplexity AI, Inc.). Only the ordering of anonymized outputs was randomized to ensure blinding. Five expert ICs independently assessed the anonymized and randomized responses using a 0 to 10 scale for appropriateness, accuracy, relevance, clarity, and clinical utility, generating a composite score. Statistical analysis was performed using a mixed linear model.

Results: Six hundred blinded evaluations (20 cases x 6 models x 5 raters) were analyzed, yielding an overall composite score of 7.1 (95% CI, 7.0-7.2). Performance significantly varied across LLMs (P less than .001), with ChatGPTi (7.8 [7.5-8.0]) and ChatGPT-gl (7.7 [7.4-7.9]) outperforming others. ChatGPTd (6.9 [6.6-7.3]), Mistral 7B (7.0 [6.7-7.3]), and Perplexity AI (7.0 [6.7-7.3]) performed moderately, while Gemini had the lowest score (6.3 [6.0-6.7]). These differences were consistent across all scoring dimensions (P less than .001). Case type did not affect LLM performance (P = .900).

Conclusions: LLMs show promise in IC decision making, but their performance remains suboptimal. Maximizing their potential requires systematic integration of web search capabilities and guideline-based knowledge retrieval.

Keywords: artificial intelligence; coronary artery disease; large language models; percutaneous coronary intervention.

PubMed Disclaimer

LinkOut - more resources