Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 6;4(4):100485.
doi: 10.1016/j.xops.2024.100485. eCollection 2024 Jul-Aug.

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Affiliations

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Prashant D Tailor et al. Ophthalmol Sci. .

Abstract

Objective: To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

Design: Randomized, masked multicenter study.

Participants: Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

Methods: Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

Main outcome: Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

Results: There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (P < 0.001, P < 0.001) between LLM, Expert and Expert + AI groups. For quality, Expert + AI (3.86 ± 0.85) performed the best overall while GPT-3.5 (3.75 ± 0.79) was the top performing LLM. For empathy, GPT-3.5 (3.75 ± 0.69) had the highest mean score followed by Expert + AI (3.73 ± 0.63). By mean score, Expert placed 4 out of 7 for quality and 6 out of 7 for empathy. For both quality (P < 0.001) and empathy (P < 0.001), expert-edited LLM responses performed better than expert-created responses. There were time savings for an expert-edited LLM response versus expert-created response (P = 0.02). ChatGPT-4 performed similar to Expert for inappropriate content (P = 0.35), missing content (P = 0.001), extent of possible harm (P = 0.356), and likelihood of possible harm (P = 0.129).

Conclusions: In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.

Financial disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.

Keywords: Artificial intelligence; ChatGPT; Chatbot; Large language model; Retina.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Kernel density plots of empathy scores by response type. The figure shows the distribution of empathy ratings given by human evaluators to different types of responses generated by 6 artificial intelligence (AI) models: Generative Pre-trained Transformer (GPT)-3.5, Expert-AI, Bard, GPT-4, Claude, and Expert. The ratings range from 1 (very poor) to 5 (very good), and the density represents the frequency of each rating.
Figure 2
Figure 2
Kernel density plots of quality scores by response type. The figure shows the distribution of quality ratings given by human evaluators to different types of responses generated by 6 artificial intelligence (AI) models: Generative Pre-trained Transformer (GPT)-3.5, Expert-AI, Bard, GPT-4, Claude, and Expert. The ratings range from 1 (very poor) to 5 (very good), and the density represents the frequency of each rating.

References

    1. Carini E., Villani L., Pezzullo A.M., et al. The impact of digital patient portals on health outcomes, system efficiency, and patient attitudes: updated systematic literature review. J Med Internet Res. 2021;23 - PMC - PubMed
    1. Akbar F., Mark G., Warton E.M., et al. Physicians' electronic inbox work patterns and factors associated with high inbox work duration. J Am Med Inform Assoc. 2021;28:923–930. - PMC - PubMed
    1. Choi J.H., Hickman K.E., Monahan A., Schwarcz D. 2023. Chatgpt Goes to Law School. Available at SSRN.
    1. North F., Luhman K.E., Mallmann E.A., et al. A retrospective analysis of provider-to-patient secure messages: how much are they increasing, who is doing the work, and is the work happening after hours? JMIR Med Inform. 2020;8 - PMC - PubMed
    1. Nath B., Williams B., Jeffery M.M., et al. Trends in electronic health record inbox messaging during the COVID-19 pandemic in an ambulatory practice network in new England. JAMA Netw Open. 2021;4 - PMC - PubMed

LinkOut - more resources