Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 29;14(13):4599.
doi: 10.3390/jcm14134599.

Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift

Affiliations

Assessing ChatGPT-v4 for Guideline-Concordant Inflammatory Bowel Disease: Accuracy, Completeness, and Temporal Drift

Oguz Ozturk et al. J Clin Med. .

Abstract

Background/Objectives: Chat Generative Pretrained Transformer (ChatGPT) is a useful resource for individuals working in the healthcare field. This paper will include descriptions of several ways in which ChatGPT-4 can achieve greater accuracy in its diagnosis and treatment plans for ulcerative colitis (UC) and Crohn's disease (CD) by following the guidelines set out by the European Crohn's and Colitis Organization (ECCO). Methods: The survey, which comprised 102 questions, was developed to assess the precision and consistency of respondents' responses regarding the UC and CD. The questionnaire incorporated true/false and multiple-choice questions, with the objective of simulating real-life scenarios and adhering to the ECCO guidelines. We employed Likert scales to assess the responses. The inquiries were put to ChatGPT-4 on the initial day, the 15th day, and the 180th day. Results: The 51 true or false items demonstrated stability over a six-month period, with an initial accuracy of 92.8% at baseline, 92.8% on the 15th day, and peaked to 98.0% on the 180th day. This finding suggests a negligible effect size. The accuracy of the multiple-choice questions was initially 90.2% on Day 1, reached its highest point at 92.2% on Day 15, and then decreased to 84.3% on Day 180. However, the reliability of the data was found to be suboptimal, and the impact was deemed negligible. A modest, transient increase in performance was observed at 15 days, which subsequently diminished by 180 days, resulting in negligible effect sizes. Conclusions: ChatGPT-4 demonstrates potential as a clinical decision support system for UC and CD, but its assessment is marked by temporal variability and the inconsistent execution of various tasks. Essential initiatives that should be carried out before involving artificial intelligence (AI) technology in IBD trials are routine revalidation, multi-rater comparisons, prompt standardization, and the cultivation of a comprehensive understanding of the model's limitations.

Keywords: ChatGPT; artificial intelligence; clinical decision support; inflammatory bowel diseases.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Flowchart illustrating ChatGPT-4′s answers to 51 binary questions across three time points. Day 0 began with 47 correct (92.2%) and 4 incorrect (7.8%) responses. By Day 15, 45 items remained correct, 2 correct answers became incorrect, 1 incorrect answer became correct, and 3 errors persisted. At Day 180, 50 answers were correct (98.0%) and 1 remained incorrect (2.0%), depicting an overall net gain despite interim fluctuations.
Figure 2
Figure 2
A grouped bar chart displaying the median Likert ratings for “completeness,” with a scale of 3 points, and “accuracy,” with a scale of 6 points, for ChatGPT-4 on Day 0, Day 15, and Day 180 is displayed in Figure 3. The three consecutive evaluations are indicated by yellow, orange, and pink bars, respectively. As illustrated, the two groups of bars on the right display the same scales for the 51 multiple-choice questions. Conversely, the two groups on the left represent the 51 binary (true/false) questions, first in terms of completeness and subsequently in terms of accuracy. As illustrated by the graph, the median scores demonstrated significant stability over time. The sole discernible fluctuation was an increase in the integrity score of the correct/incorrect questions on Day 180. MC: multiple choice.
Figure 3
Figure 3
The following flowchart illustrates ChatGPT’s performance on 51 multiple-choice questions over a specified time period. The initial test exhibited five errors, accounting for 9.8% of the total mistakes, while demonstrating an accuracy rate of 46 out of 51, corresponding to 90.2% precision. On the 15th day of observation, the highest proportion of correct answers was observed to be 47 (92.2%). By the 180th day, the number of accurate responses decreased to 43 (84.3%), while the number of incorrect answers increased to 8 (15.7%). This indicates that the transient improvement observed on Day 15 had resolved.
Figure 4
Figure 4
The analysis of response instability in subgroups revealed significant differences between groups. Diagnosis questions had the highest deterioration rate at 15.0% (6 out of 40 questions), followed by treatment questions at 11.5% (3 out of 26 questions). Imaging questions had the highest stability, with a rate of 2.9% (1 out of 34 questions).

References

    1. Bruner L.P., White A.M., Proksell S. Inflammatory Bowel Disease. Prim. Care. 2023;50:411–427. doi: 10.1016/j.pop.2023.03.009. - DOI - PubMed
    1. Burisch J., Vardi H., Pedersen N., Brinar M., Cukovic-Cavka S., Kaimakliotis I., Duricova D., Bortlik M., Shonová O., Vind I., et al. Costs and resource utilization for diagnosis and treatment during the initial year in a European inflammatory bowel disease inception cohort: An ECCO-EpiCom Study. Inflamm. Bowel Dis. 2015;21:121–131. doi: 10.1097/MIB.0000000000000250. - DOI - PubMed
    1. Johnson D., Goodman R., Patrinely J., Stone C., Zimmerman E., Donald R., Chang S., Berkowitz S., Finn A., Jahangir E., et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res. Sq. 2023 doi: 10.21203/rs.3.rs-2566942/v1. - DOI
    1. Cankurtaran R.E., Polat Y.H., Aydemir N.G., Umay E., Yurekli O.T. Reliability and Usefulness of ChatGPT for Inflammatory Bowel Diseases: An Analysis for Patients and Healthcare Professionals. Cureus. 2023;15:e46736. doi: 10.7759/cureus.46736. - DOI - PMC - PubMed
    1. Sciberras M., Farrugia Y., Gordon H., Furfaro F., Allocca M., Torres J., Arebi N., Fiorino G., Iacucci M., Verstockt B., et al. Accuracy of Information given by ChatGPT for Patients with Inflammatory Bowel Disease in Relation to ECCO Guidelines. J. Crohn’s Colitis. 2024;18:1215–1221. doi: 10.1093/ecco-jcc/jjae040. - DOI - PubMed

LinkOut - more resources