Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 16;19(4):e0301854.
doi: 10.1371/journal.pone.0301854. eCollection 2024.

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain

Affiliations

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain

Thomas F Heston et al. PLoS One. .

Abstract

Background: ChatGPT-4 is a large language model with promising healthcare applications. However, its ability to analyze complex clinical data and provide consistent results is poorly known. Compared to validated tools, this study evaluated ChatGPT-4's risk stratification of simulated patients with acute nontraumatic chest pain.

Methods: Three datasets of simulated case studies were created: one based on the TIMI score variables, another on HEART score variables, and a third comprising 44 randomized variables related to non-traumatic chest pain presentations. ChatGPT-4 independently scored each dataset five times. Its risk scores were compared to calculated TIMI and HEART scores. A model trained on 44 clinical variables was evaluated for consistency.

Results: ChatGPT-4 showed a high correlation with TIMI and HEART scores (r = 0.898 and 0.928, respectively), but the distribution of individual risk assessments was broad. ChatGPT-4 gave a different risk 45-48% of the time for a fixed TIMI or HEART score. On the 44-variable model, a majority of the five ChatGPT-4 models agreed on a diagnosis category only 56% of the time, and risk scores were poorly correlated (r = 0.605).

Conclusion: While ChatGPT-4 correlates closely with established risk stratification tools regarding mean scores, its inconsistency when presented with identical patient data on separate occasions raises concerns about its reliability. The findings suggest that while large language models like ChatGPT-4 hold promise for healthcare applications, further refinement and customization are necessary, particularly in the clinical risk assessment of atraumatic chest pain patients.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Histogram of TIMI and ChatGPT-4 scores.
Both the TIMI and ChatGPT-4 scores demonstrated a normal distribution. However, the distribution of ChatGPT-4 scores was broader than that of the TIMI scores.
Fig 2
Fig 2. Comparison of TIMI with ChatGPT-4.
The correlation between TIMI risk score and ChatGPT-4 risk estimates over 5 simulated runs. While the overall correlation was high (R-squared = 0.806), ChatGPT-4’s scores demonstrated broad variability across the distribution relative to the TIMI benchmark standard.
Fig 3
Fig 3. Histogram of HEART and ChatGPT-4 scores.
Both the HEART and ChatGPT-4 scores exhibited a normal distribution. However, the distribution of ChatGPT-4 scores was broader than that of the HEART scores.
Fig 4
Fig 4. Comparison of HEART with ChatGPT-4.
The correlation between HEART risk score and ChatGPT-4 scores over 5 simulated runs. While the overall correlation was high (R-squared = 0.861), ChatGPT-4 demonstrated broad score variability across the distribution relative to the HEART benchmark standard.
Fig 5
Fig 5. Individual model scores compared to average scores for the history and physical-only dataset.
There was a poor correlation between the individual model scores and the average ChatGPT-4 score, consistent with wide variation between the models.
Fig 6
Fig 6. Model agreement of most likely diagnosis category.
In assessing model agreement for the diagnostic category, nearly always, at least two models reached a consensus. However, it was rare for all five models to agree.

References

    1. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5: 115–133. - PubMed
    1. Biever C. ChatGPT broke the Turing test—the race is on for new ways to assess AI. Nature. 2023;619: 686–689. doi: 10.1038/d41586-023-02361-7 - DOI - PubMed
    1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: NIPS’17: proceedings of the 31st international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc; 2017. pp. 6000–6010.
    1. Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018. [cited 20 Jun 2023]. Available from: https://web.archive.org/web/20230622213848/. https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf.
    1. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al.. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2: e0000198. doi: 10.1371/journal.pdig.0000198 - DOI - PMC - PubMed