. 2025 Jun 13;20(6):e0325982.

doi: 10.1371/journal.pone.0325982. eCollection 2025.

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Thomas P J Solomon¹, Matthew J Laye²

Affiliations

PMID: 40512755
PMCID: PMC12165421
DOI: 10.1371/journal.pone.0325982

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Thomas P J Solomon et al. PLoS One. 2025.

. 2025 Jun 13;20(6):e0325982.

doi: 10.1371/journal.pone.0325982. eCollection 2025.

Authors

Thomas P J Solomon¹, Matthew J Laye²

Affiliations

¹ Blazon Scientific, London, United Kingdom.
² Idaho College of Osteopathic Medicine, Meridian, Idaho, United States of America.

PMID: 40512755
PMCID: PMC12165421
DOI: 10.1371/journal.pone.0325982

Abstract

Background: Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude's basic and advanced models across these metrics to determine their utility in providing sports nutrition information.

Materials and methods: Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Intraclass correlation coefficient (ICC) was used to assess interrater agreement and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots' answers to multiple-choice questions based on a sports nutrition certification exam. ANOVAs and logistic mixed models were used to analyse chatbot performance.

Results: In Experiment 1, interrater agreement was good (ICC = 0.893) and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude's accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments.

Conclusions: While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.

Copyright: © 2025 Solomon, Laye. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

TS has given invited talks at societal conferences and university/pharmaceutical symposia for which the organisers paid for travel and accommodation; he has also received research money from publicly funded national research councils and medical charities, and private companies, including Novo Nordisk Foundation, AstraZeneca, Amylin, AP Møller Foundation, and Augustinus Foundation; and, he has consulted for Boost Treadmills, GU Energy, and Examine.com, and owns a consulting business, Blazon Scientific, and an endurance athlete education business, Veohtu. These companies have had no control over the research design, data analysis, or publication outcomes of this work. ML has given invited talks at societal conferences and university symposia and meetings for which the organisers paid for travel and accommodation; he has received research money from Augustinus Foundation, American College of Sports Medicine, and national research institutions; and, he has consulted for Zepp Health, Levels Health, GU Energy, and EAB labs, and has coached for Sharman Ultra Coaching. These companies have had no control over the research design, data analysis, or publication outcomes of this work. My Sports Dietitian provided a set of multiple-choice questions designed to resemble the Certified Specialist in Sports Dietetics (CSSD) board exam. Neither TPJS nor MJL have any financial relationships with My Sports Dietitian. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. Accuracy scores among different chatbots on the two test days in Experiment 1.**
Panels [A] and [B] display the overall accuracy scores across both domains (*Sports Nutrition for Training* and *Sports Nutrition for Racing*) for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] display the accuracy scores in the *Training* domain for the *Simple* and *Detailed* prompts, respectively, while panels [E] and [F] show the accuracy scores in the *Racing* domain. Bars represent the mean of accuracy scores for each criterion and error bars represent the standard deviation (SD). ANOVA revealed a significant main effect of Chatbot ID where accuracy scores for ChatGPT-4omini (comparison “a”: p = 0.008, d = 0.549), ChatGPT-4o (“b”: p < 0.001, d = 0.796), and Gemini1.5pro (“c”: p < 0.001, d = 0.752) were greater than ClaudePro, and the accuracy scores for ChatGPT-4o (“d”: p = 0.008, d = 0.546) and Gemini1.5pro (“e”: p = 0.02, d = 0.502) were greater than Claude3.5Sonnet.

**Fig 2. Accuracy criteria in the *Training* domain among different chatbots on the two test days in Experiment 1.**
The accuracy scores for all criteria measured in the *Sports Nutrition for Training* domain. Panels [A] and [B] show accuracy for the *Energy availability* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] show accuracy for the *Daily carbohydrate intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [E] and [F] show accuracy for the *Daily protein intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [G] and [H] show accuracy for the *Post-session carbohydrate intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [I] and [J] show accuracy for the *Post-session protein intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [K] and [L] show accuracy for the *Hydration* criterion for the *Simple* and *Detailed* prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

**Fig 3. Accuracy criteria in the *Training* domain among different chatbots on the two test days in Experiment 1 (continued).**
The accuracy scores for all criteria measured in the *Sports Nutrition for Training* domain. Panels [A] and [B] show accuracy for the *Supplements* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] show accuracy for the *Individualisation* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [E] and [F] show accuracy for the *Disclaimer* criterion for the *Simple* and *Detailed* prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

**Fig 4. Accuracy criteria in the *Racing* domain among different chatbots on the two test days in Experiment 1.**
The accuracy scores for all criteria measured in the *Sports Nutrition for Racing* domain. Panels [A] and [B] show accuracy for the *Daily carbohydrate intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] show accuracy for the *Daily food examples* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [E] and [F] show accuracy for the *Pre-race carbohydrate intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [G] and [H] show accuracy for the *Pre-race food examples* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [I] and [J] show accuracy for the *During-race carbohydrate intake* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [K] and [L] show accuracy for the *During-race food examples* criterion for the *Simple* and *Detailed* prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

**Fig 5. Accuracy criteria in the *Racing* domain among different chatbots on the two test days in Experiment 1 (continued).**
The accuracy scores for all criteria measured in the *Sports Nutrition for Racing* domain. Panels [A] and [B] show accuracy for the *Pre-race hydration* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] show accuracy for the *During-race hydration* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [E] and [F] show accuracy for the *Supplements* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [G] and [H] show accuracy for the *Individualisation* criterion for the *Simple* and *Detailed* prompts, respectively. Panels [I] and [J] show accuracy for the *Disclaimer* criterion for the *Simple* and *Detailed* prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.

**Fig 6. Completeness among different chatbots on the two test days in Experiment 1.**
Panels [A] and [B] display the overall completeness scores across both domains (*Sports Nutrition for Training* and *Sports Nutrition for Racing*) for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] display the completeness scores in the *Training* domain for the *Simple* and *Detailed* prompts, respectively, while panels [E] and [F] show the completeness scores in the *Racing* domain. Completeness in each domain was rated on a Likert scale of 1–3; therefore, overall completeness had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

**Fig 7. Clarity among different chatbots on the two test days in Experiment 1.**
Panels [A] and [B] display the overall clarity scores across both domains (*Sports Nutrition for Training* and *Sports Nutrition for Racing*) for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] display the clarity scores in the *Training* domain for the *Simple* and *Detailed* prompts, respectively, while panels [E] and [F] show the clarity scores in the *Racing* domain. Clarity in each domain was rated on a Likert scale of 1–3 therefore, overall clarity had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

**Fig 8. The quality of cited evidence among different chatbots on the two test days in Experiment 1.**
Panels [A] and [B] display the overall quality of cited evidence scores across both domains (*Sports Nutrition for Training* and *Sports Nutrition for Racing*) for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] display the quality of cited evidence scores in the *Training* domain for the *Simple* and *Detailed* prompts, respectively, while panels [E] and [F] show the quality of cited evidence scores in the *Racing* domain. The quality of cited evidence in each domain was rated on a Likert scale of 1–3 therefore, overall evidence quality had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

**Fig 9. The quality of additional information among different chatbots on the two test days in Experiment 1.**
Panels [A] and [B] display the overall quality of additional information scores across both domains (*Sports Nutrition for Training* and *Sports Nutrition for Racing*) for the *Simple* and *Detailed* prompts, respectively. Panels [C] and [D] display the quality of additional information scores in the *Training* domain for the *Simple* and *Detailed* prompts, respectively, while panels [E] and [F] show the quality of additional information scores in the *Racing* domain. The quality of additional information in each domain was rated on a Likert scale of 1–5 therefore, overall additional information quality had a maximum Likert score of 10. Data represent the median ± interquartile range (IQR) of the two raters’ scores.

**Fig 10. The proportion of correct answers to exam questions among different chatbots on the two test days in Experiment 2.**
Bars represent the proportion of correct answers and error bars represent the standard error (SE). Claude3.5Sonnet scored a higher proportion of correct answers to the exam than ChatGPT-4omini (comparison “a”: p = 0.0001; r = −0.976, 95%CI −0.997 to −0.792), ChatGPT-4o (“b”: p = 0.04; r = −0.948, 95%CI −0.994 to −0.589), ClaudePro (“c”: p < .0001; r = −0.983, 95%CI −0.998 to −0.846), and Gemini1.5flash “d”: (p < .0001; r = −0.978, 95%CI −0.998 to −0.807). Gemini1.5pro also scored a higher proportion of correct answers than ChatGPT-4omini (“e”: p = 0.004; r = −0.964, 95%CI −0.700 to 0.952), ClaudePro (“f”: p = 0.0001; r = −0.976, 95%CI −0.998 to −0.793), and Gemini1.5flash (“g”: p = 0.002; r = −0.967, 95%CI −0.997 to −0.726). The proportion of correct answers was not different between test days.

See this image and copyright information in PMC

References

1. Grand View Research. Chatbot Market Size, Share & Trends, Analysis Report By Application (Customer Services, Branding & Advertising), By Type, By Vertical, By Region (North America, Europe, Asia Pacific, South America), And Segment Forecasts, 2023 - 2030. [cited 6 Sep 2024]. Available: https://www.grandviewresearch.com/industry-analysis/chatbot-market
1. Google. Google Trends for “ChatGPT”, “Microsoft Copilot”, “Gemini”, and “Claude”. [cited 6 Sep 2024]. Available: https://trends.google.com/trends/explore?date=2022-01-01%202024-04-25&q=...
1. AI Endurance. AI Endurance: AI running, cycling, and triathlon coach. [cited 20 Jan 2025]. Available: https://aiendurance.com/
1. AlbonApp. Trail running training app. [cited 12 Dec 2024]. Available: https://www.albon.app/
1. Vert.run. A training app for trail and ultrarunners. [cited 12 Dec 2024]. Available: https://vert.run/

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central
- Public Library of Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Affiliations

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources