Analyses of different prescriptions for health using artificial intelligence: a critical approach based on the international guidelines of health institutions
- PMID: 40832454
- PMCID: PMC12358344
- DOI: 10.1007/s13755-025-00368-0
Analyses of different prescriptions for health using artificial intelligence: a critical approach based on the international guidelines of health institutions
Abstract
Purpose: Large-language models (LLMs) are increasingly used for health advice, but their alignment with evidence-based guidelines and sensitivity to question phrasing remain unclear.
Methods: In May 2025, we evaluated ChatGPT 4.0, ChatGPT 4.5, and DeepSeek V3 using four clinical vignettes: major depression with polysubstance use, irritable bowel syndrome flare, new-onset hypertension requiring exercise counseling, and chronic low back pain. Each scenario was tested with clinician- and patient-style prompts, generating 24 responses. Outputs were benchmarked against 89 guideline-derived recommendations from three authoritative sources per domain. Two blinded reviewers scored concordance (1 = actionable detail, 0.5 = generic mention, 0 = absent), with adjudication by a third reviewer. Inter-rater reliability was measured using Cronbach's α.
Results: ChatGPT 4.5 achieved the highest guideline concordance (61.9%), followed by DeepSeek V3 (60.7%) and ChatGPT 4.0 (53.7%). Performance varied by domain, exceeding 67% in mental health but dropping below 45% in nutrition. Prompt phrasing influenced capture rates, with clinician-style prompts improving scores in exercise and pain domains, while patient-style prompts outperformed in nutrition. Reviewer agreement was high (α = 0.97 for chatbot scoring; 0.80 for matrix coding).
Conclusion: LLMs can rapidly generate draft care plans that reflect clinical guidelines, though they favor generic over individualized advice. By introducing a unique, domain-agnostic scoring rubric that aligns AI-generated 30-day care plans with gold-standard guidelines, and by applying it in parallel to mental health, nutrition, exercise, and physical therapy scenarios, our study delivers the first prompt-sensitive audit showing where current LLMs exceed, match, or fall short of multidisciplinary best practices.
Supplementary information: The online version contains supplementary material available at 10.1007/s13755-025-00368-0.
Keywords: Artificial intelligence; Chatbots in healthcare; Digital health; Machine learning; Personalized medicine.
© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2025. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Conflict of interest statement
Conflict of interestThe authors declare no conflicts of interest.
References
-
- Michael L. Littman. Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report. 2021.
-
- Aung YYM, Wong DCS, Ting DSW. The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. Br Med Bull. 2021;139:4–15. - PubMed
-
- Pavlik JV. Collaborating with ChatGPT: considering the implications of generative artificial intelligence for journalism and media education. Journalism Mass Commun Educator. 2023;78:84–93.
LinkOut - more resources
Full Text Sources