Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;40(10):3211-3218.
doi: 10.1007/s00467-025-06819-w. Epub 2025 Jun 3.

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Affiliations

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Olivier Niel et al. Pediatr Nephrol. 2025 Oct.

Abstract

Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.

Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.

Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.

Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.

Keywords: Artificial intelligence; ChatGPT; Large language model; Machine learning; Nephrology; Pediatrics.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval: Not applicable, no human participant. Informed consent: Not applicable, no human participant. Consent for publication: The authors agree with the publication of this manuscript in Pediatric Nephrology. Competing interests: The authors declare no competing interests.

References

    1. Turing AM (1950) Computing machinery and intelligence. Mind LIX:433–460. https://doi.org/10.1093/mind/LIX.236.433
    1. Maaz S (2025) A guide to prompt design: foundations and applications for healthcare simulationists. Front Med 11:1504532. https://doi.org/10.3389/fmed.2024.1504532 - DOI
    1. Naveed H, Khan AU, Qiu S et al (2023) A comprehensive overview of large language models. https://doi.org/10.48550/arXiv.2307.06435
    1. Ling C, Balaji A, Beltramelli T et al (2023) Domain specialization as the key to make large language models disruptive: a comprehensive survey (Version 7). https://doi.org/10.48550/ARXIV.2305.18703
    1. Varnosfaderani M (2024) The role of AI in hospitals and clinics: transforming healthcare in the 21st century. Bioengineering 11:337. https://doi.org/10.3390/bioengineering11040337 - DOI