Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment
- PMID: 40461786
- DOI: 10.1007/s00467-025-06819-w
Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment
Abstract
Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.
Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.
Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.
Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.
Keywords: Artificial intelligence; ChatGPT; Large language model; Machine learning; Nephrology; Pediatrics.
© 2025. The Author(s), under exclusive licence to International Pediatric Nephrology Association.
Conflict of interest statement
Declarations. Ethics approval: Not applicable, no human participant. Informed consent: Not applicable, no human participant. Consent for publication: The authors agree with the publication of this manuscript in Pediatric Nephrology. Competing interests: The authors declare no competing interests.
References
-
- Turing AM (1950) Computing machinery and intelligence. Mind LIX:433–460. https://doi.org/10.1093/mind/LIX.236.433
-
- Maaz S (2025) A guide to prompt design: foundations and applications for healthcare simulationists. Front Med 11:1504532. https://doi.org/10.3389/fmed.2024.1504532 - DOI
-
- Naveed H, Khan AU, Qiu S et al (2023) A comprehensive overview of large language models. https://doi.org/10.48550/arXiv.2307.06435
-
- Ling C, Balaji A, Beltramelli T et al (2023) Domain specialization as the key to make large language models disruptive: a comprehensive survey (Version 7). https://doi.org/10.48550/ARXIV.2305.18703
-
- Varnosfaderani M (2024) The role of AI in hospitals and clinics: transforming healthcare in the 21st century. Bioengineering 11:337. https://doi.org/10.3390/bioengineering11040337 - DOI
MeSH terms
LinkOut - more resources
Full Text Sources
Medical