Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Olivier Niel¹, Dishana Dookhun², Ancuta Caliment³

Affiliations

¹ Pediatric Nephrology Unit, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg. niel.olivier@chl.lu.
² Pediatric Department, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg.
³ Pediatric Nephrology Unit, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg.

PMID: 40461786
DOI: 10.1007/s00467-025-06819-w

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Olivier Niel et al. Pediatr Nephrol. 2025 Oct.

. 2025 Oct;40(10):3211-3218.

doi: 10.1007/s00467-025-06819-w. Epub 2025 Jun 3.

Authors

Olivier Niel¹, Dishana Dookhun², Ancuta Caliment³

Affiliations

¹ Pediatric Nephrology Unit, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg. niel.olivier@chl.lu.
² Pediatric Department, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg.
³ Pediatric Nephrology Unit, Centre Hospitalier de Luxembourg, 4 rue Barblé, L1210, Luxembourg, Luxembourg.

PMID: 40461786
DOI: 10.1007/s00467-025-06819-w

Abstract

Background: Large language models (LLMs) have emerged as potential tools in health care following advancements in artificial intelligence. Despite promising applications across multiple medical specialties, limited research exists regarding LLM implementation in pediatric nephrology. This study evaluates the performance of contemporary LLMs in supporting clinical decision-making processes for practicing pediatric nephrologists.

Methods: Ten comprehensive clinical cases covering various aspects of pediatric nephrology were designed and validated by experts based on international guidelines. Each case comprised questions addressing diagnosis, biological/imaging explorations, treatments, and logic. Ten LLMs were assessed, including generalist models (Claude, ChatGPT, Gemini, DeepSeek, Mistral, Copilot, Perplexity, Phi 4) and a specialized model (Phi 4 Nomic) fine-tuned using retrieval-augmented generation with validated pediatric nephrology materials. Performance was evaluated based on accuracy, personalization, internal contradictions, hallucinations, and potentially dangerous decisions.

Results: Overall accuracy ranged from 50.8% (Gemini) to 86.9% (Claude), with a mean of 66.24%. Claude significantly outperformed other models (p = 0.01). Personalization scores varied between 50% (ChatGPT) and 85% (Claude). All models exhibited hallucinations (2-8 occurrences) and potentially life-threatening decisions (0-2 occurrences). Domain-specific fine-tuning improved performance across all clinical criteria without enhancing reasoning capabilities. Performance variability was minimal, with higher performing models demonstrating greater consistency.

Conclusions: While certain LLMs demonstrate promising accuracy in pediatric nephrology applications, persistent challenges including hallucinations and potentially dangerous recommendations preclude autonomous clinical implementation. LLMs may currently serve supportive roles in repetitive tasks, but they should be used under strict supervision in clinical practice. Future advancements addressing hallucination mitigation and interpretability are necessary before broader clinical integration.

Keywords: Artificial intelligence; ChatGPT; Large language model; Machine learning; Nephrology; Pediatrics.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval: Not applicable, no human participant. Informed consent: Not applicable, no human participant. Consent for publication: The authors agree with the publication of this manuscript in Pediatric Nephrology. Competing interests: The authors declare no competing interests.

References

1. Turing AM (1950) Computing machinery and intelligence. Mind LIX:433–460. https://doi.org/10.1093/mind/LIX.236.433
1. Maaz S (2025) A guide to prompt design: foundations and applications for healthcare simulationists. Front Med 11:1504532. https://doi.org/10.3389/fmed.2024.1504532 - DOI
1. Naveed H, Khan AU, Qiu S et al (2023) A comprehensive overview of large language models. https://doi.org/10.48550/arXiv.2307.06435
1. Ling C, Balaji A, Beltramelli T et al (2023) Domain specialization as the key to make large language models disruptive: a comprehensive survey (Version 7). https://doi.org/10.48550/ARXIV.2305.18703
1. Varnosfaderani M (2024) The role of AI in hospitals and clinics: transforming healthcare in the 21st century. Bioengineering 11:337. https://doi.org/10.3390/bioengineering11040337 - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Springer
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Affiliations

Performance evaluation of large language models in pediatric nephrology clinical decision support: a comprehensive assessment

Authors

Affiliations

Abstract

Conflict of interest statement

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical