Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0

Jisun Choi^#¹, Ah Ran Oh^#¹, Jungchan Park¹, Ryung A Kang¹, Seung Yeon Yoo¹, Dong Jae Lee¹, Kwangmo Yang²

Affiliations

¹ Department of Anesthesiology and Pain Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.
² Center for Health Promotion, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.

^# Contributed equally.

PMID: 39055693
PMCID: PMC11269144
DOI: 10.3389/fmed.2024.1400153

Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0

Jisun Choi et al. Front Med (Lausanne). 2024.

. 2024 Jul 11:11:1400153.

doi: 10.3389/fmed.2024.1400153. eCollection 2024.

Authors

Jisun Choi^#¹, Ah Ran Oh^#¹, Jungchan Park¹, Ryung A Kang¹, Seung Yeon Yoo¹, Dong Jae Lee¹, Kwangmo Yang²

Affiliations

¹ Department of Anesthesiology and Pain Medicine, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.
² Center for Health Promotion, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea.

^# Contributed equally.

PMID: 39055693
PMCID: PMC11269144
DOI: 10.3389/fmed.2024.1400153

Abstract

Introduction: The large-scale artificial intelligence (AI) language model chatbot, Chat Generative Pre-Trained Transformer (ChatGPT), is renowned for its ability to provide data quickly and efficiently. This study aimed to assess the medical responses of ChatGPT regarding anesthetic procedures.

Methods: Two anesthesiologist authors selected 30 questions representing inquiries patients might have about surgery and anesthesia. These questions were inputted into two versions of ChatGPT in English. A total of 31 anesthesiologists then evaluated each response for quality, quantity, and overall assessment, using 5-point Likert scales. Descriptive statistics summarized the scores, and a paired sample t-test compared ChatGPT 3.5 and 4.0.

Results: Regarding quality, "appropriate" was the most common rating for both ChatGPT 3.5 and 4.0 (40 and 48%, respectively). For quantity, responses were deemed "insufficient" in 59% of cases for 3.5, and "adequate" in 69% for 4.0. In overall assessment, 3 points were most common for 3.5 (36%), while 4 points were predominant for 4.0 (42%). Mean quality scores were 3.40 and 3.73, and mean quantity scores were - 0.31 (between insufficient and adequate) and 0.03 (between adequate and excessive), respectively. The mean overall score was 3.21 for 3.5 and 3.67 for 4.0. Responses from 4.0 showed statistically significant improvement in three areas.

Conclusion: ChatGPT generated responses mostly ranging from appropriate to slightly insufficient, providing an overall average amount of information. Version 4.0 outperformed 3.5, and further research is warranted to investigate the potential utility of AI chatbots in assisting patients with medical information.

Keywords: AI chatbot; ChatGPT; artificial intelligence; quality; quantity.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

See this image and copyright information in PMC

References

1. Nepogodiev D, Martin J, Biccard B, Makupe A, Bhangu A, National Institute for Health Research Global Health Research Unit on Global Survey . Global burden of postoperative death. Lancet. (2019) 393:401. doi: 10.1016/S0140-6736(18)33139-8, PMID: - DOI - PubMed
1. Stefani LC, Gamermann PW, Backof A, Guollo F, Borges RMJ, Martin A, et al. . Perioperative mortality related to anesthesia within 48 h and up to 30 days following surgery: a retrospective cohort study of 11,562 anesthetic procedures. J Clin Anesth. (2018) 49:79–86. doi: 10.1016/j.jclinane.2018.06.025, PMID: - DOI - PubMed
1. Ramsay MA. A survey of pre-operative fear. Anaesthesia. (1972) 27:396–402. doi: 10.1111/j.1365-2044.1972.tb08244.x - DOI - PubMed
1. Kassahun WT, Mehdorn M, Wagner TC, Babel J, Danker H, Gockel I. The effect of preoperative patient-reported anxiety on morbidity and mortality outcomes in patients undergoing major general surgery. Sci Rep. (2022) 12:6312. doi: 10.1038/s41598-022-10302-z, PMID: - DOI - PMC - PubMed
1. Schulman J, Zoph B, Kim C, Hilton J, Menick J, Weng J, et al. . Chat GPT: optimizing language models for dialogue. OpenAI Blog (2022). Available at: https://openai.com/blog/chatgpt

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0

Affiliations

Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources