Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 13;20(6):e0325982.
doi: 10.1371/journal.pone.0325982. eCollection 2025.

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Affiliations

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

Thomas P J Solomon et al. PLoS One. .

Abstract

Background: Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude's basic and advanced models across these metrics to determine their utility in providing sports nutrition information.

Materials and methods: Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Intraclass correlation coefficient (ICC) was used to assess interrater agreement and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots' answers to multiple-choice questions based on a sports nutrition certification exam. ANOVAs and logistic mixed models were used to analyse chatbot performance.

Results: In Experiment 1, interrater agreement was good (ICC = 0.893) and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude's accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments.

Conclusions: While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.

PubMed Disclaimer

Conflict of interest statement

TS has given invited talks at societal conferences and university/pharmaceutical symposia for which the organisers paid for travel and accommodation; he has also received research money from publicly funded national research councils and medical charities, and private companies, including Novo Nordisk Foundation, AstraZeneca, Amylin, AP Møller Foundation, and Augustinus Foundation; and, he has consulted for Boost Treadmills, GU Energy, and Examine.com, and owns a consulting business, Blazon Scientific, and an endurance athlete education business, Veohtu. These companies have had no control over the research design, data analysis, or publication outcomes of this work. ML has given invited talks at societal conferences and university symposia and meetings for which the organisers paid for travel and accommodation; he has received research money from Augustinus Foundation, American College of Sports Medicine, and national research institutions; and, he has consulted for Zepp Health, Levels Health, GU Energy, and EAB labs, and has coached for Sharman Ultra Coaching. These companies have had no control over the research design, data analysis, or publication outcomes of this work. My Sports Dietitian provided a set of multiple-choice questions designed to resemble the Certified Specialist in Sports Dietetics (CSSD) board exam. Neither TPJS nor MJL have any financial relationships with My Sports Dietitian. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Accuracy scores among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall accuracy scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the accuracy scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the accuracy scores in the Racing domain. Bars represent the mean of accuracy scores for each criterion and error bars represent the standard deviation (SD). ANOVA revealed a significant main effect of Chatbot ID where accuracy scores for ChatGPT-4omini (comparison “a”: p = 0.008, d = 0.549), ChatGPT-4o (“b”: p < 0.001, d = 0.796), and Gemini1.5pro (“c”: p < 0.001, d = 0.752) were greater than ClaudePro, and the accuracy scores for ChatGPT-4o (“d”: p = 0.008, d = 0.546) and Gemini1.5pro (“e”: p = 0.02, d = 0.502) were greater than Claude3.5Sonnet.
Fig 2
Fig 2. Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1.
The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Energy availability criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Daily protein intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Post-session carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Post-session protein intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the Hydration criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 3
Fig 3. Accuracy criteria in the Training domain among different chatbots on the two test days in Experiment 1 (continued).
The accuracy scores for all criteria measured in the Sports Nutrition for Training domain. Panels [A] and [B] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 4
Fig 4. Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1.
The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Daily carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the Daily food examples criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Pre-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Pre-race food examples criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the During-race carbohydrate intake criterion for the Simple and Detailed prompts, respectively. Panels [K] and [L] show accuracy for the During-race food examples criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 5
Fig 5. Accuracy criteria in the Racing domain among different chatbots on the two test days in Experiment 1 (continued).
The accuracy scores for all criteria measured in the Sports Nutrition for Racing domain. Panels [A] and [B] show accuracy for the Pre-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [C] and [D] show accuracy for the During-race hydration criterion for the Simple and Detailed prompts, respectively. Panels [E] and [F] show accuracy for the Supplements criterion for the Simple and Detailed prompts, respectively. Panels [G] and [H] show accuracy for the Individualisation criterion for the Simple and Detailed prompts, respectively. Panels [I] and [J] show accuracy for the Disclaimer criterion for the Simple and Detailed prompts, respectively. Each criteria score is the median ± interquartile range (IQR) of the two raters’ scores; therefore, no statistical analyses could be made.
Fig 6
Fig 6. Completeness among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall completeness scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the completeness scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the completeness scores in the Racing domain. Completeness in each domain was rated on a Likert scale of 1–3; therefore, overall completeness had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 7
Fig 7. Clarity among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall clarity scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the clarity scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the clarity scores in the Racing domain. Clarity in each domain was rated on a Likert scale of 1–3 therefore, overall clarity had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 8
Fig 8. The quality of cited evidence among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall quality of cited evidence scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of cited evidence scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of cited evidence scores in the Racing domain. The quality of cited evidence in each domain was rated on a Likert scale of 1–3 therefore, overall evidence quality had a maximum Likert score of 6. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 9
Fig 9. The quality of additional information among different chatbots on the two test days in Experiment 1.
Panels [A] and [B] display the overall quality of additional information scores across both domains (Sports Nutrition for Training and Sports Nutrition for Racing) for the Simple and Detailed prompts, respectively. Panels [C] and [D] display the quality of additional information scores in the Training domain for the Simple and Detailed prompts, respectively, while panels [E] and [F] show the quality of additional information scores in the Racing domain. The quality of additional information in each domain was rated on a Likert scale of 1–5 therefore, overall additional information quality had a maximum Likert score of 10. Data represent the median ± interquartile range (IQR) of the two raters’ scores.
Fig 10
Fig 10. The proportion of correct answers to exam questions among different chatbots on the two test days in Experiment 2.
Bars represent the proportion of correct answers and error bars represent the standard error (SE). Claude3.5Sonnet scored a higher proportion of correct answers to the exam than ChatGPT-4omini (comparison “a”: p = 0.0001; r = −0.976, 95%CI −0.997 to −0.792), ChatGPT-4o (“b”: p = 0.04; r = −0.948, 95%CI −0.994 to −0.589), ClaudePro (“c”: p < .0001; r = −0.983, 95%CI −0.998 to −0.846), and Gemini1.5flash “d”: (p < .0001; r = −0.978, 95%CI −0.998 to −0.807). Gemini1.5pro also scored a higher proportion of correct answers than ChatGPT-4omini (“e”: p = 0.004; r = −0.964, 95%CI −0.700 to 0.952), ClaudePro (“f”: p = 0.0001; r = −0.976, 95%CI −0.998 to −0.793), and Gemini1.5flash (“g”: p = 0.002; r = −0.967, 95%CI −0.997 to −0.726). The proportion of correct answers was not different between test days.

References

    1. Grand View Research. Chatbot Market Size, Share & Trends, Analysis Report By Application (Customer Services, Branding & Advertising), By Type, By Vertical, By Region (North America, Europe, Asia Pacific, South America), And Segment Forecasts, 2023 - 2030. [cited 6 Sep 2024]. Available: https://www.grandviewresearch.com/industry-analysis/chatbot-market
    1. Google. Google Trends for “ChatGPT”, “Microsoft Copilot”, “Gemini”, and “Claude”. [cited 6 Sep 2024]. Available: https://trends.google.com/trends/explore?date=2022-01-01%202024-04-25&q=...
    1. AI Endurance. AI Endurance: AI running, cycling, and triathlon coach. [cited 20 Jan 2025]. Available: https://aiendurance.com/
    1. AlbonApp. Trail running training app. [cited 12 Dec 2024]. Available: https://www.albon.app/
    1. Vert.run. A training app for trail and ultrarunners. [cited 12 Dec 2024]. Available: https://vert.run/

LinkOut - more resources