Comment

. 2024 Sep 1;142(9):798-805.

doi: 10.1001/jamaophthalmol.2024.2513.

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology

Ming-Jie Luo¹, Jianyu Pang¹, Shaowei Bi¹, Yunxi Lai¹, Jiaman Zhao¹, Yuanrui Shang², Tingxin Cui¹, Yahan Yang¹, Zhenzhe Lin¹, Lanqin Zhao¹, Xiaohang Wu¹, Duoru Lin¹, Jingjing Chen¹, Haotian Lin^{1

3

4}

Affiliations

¹ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.
² The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China.
³ Center for Precision Medicine and Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, China.
⁴ Hainan Eye Hospital and Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Haikou, China.

PMID: 39023885
PMCID: PMC11258636
DOI: 10.1001/jamaophthalmol.2024.2513

Comment

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology

Ming-Jie Luo et al. JAMA Ophthalmol. 2024.

. 2024 Sep 1;142(9):798-805.

doi: 10.1001/jamaophthalmol.2024.2513.

Authors

Affiliations

¹ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China.
² The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China.
³ Center for Precision Medicine and Department of Genetics and Biomedical Informatics, Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, Guangdong, China.
⁴ Hainan Eye Hospital and Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Haikou, China.

PMID: 39023885
PMCID: PMC11258636
DOI: 10.1001/jamaophthalmol.2024.2513

Abstract

Importance: Although augmenting large language models (LLMs) with knowledge bases may improve medical domain-specific performance, practical methods are needed for local implementation of LLMs that address privacy concerns and enhance accessibility for health care professionals.

Objective: To develop an accurate, cost-effective local implementation of an LLM to mitigate privacy concerns and support their practical deployment in health care settings.

Design, setting, and participants: ChatZOC (Sun Yat-Sen University Zhongshan Ophthalmology Center), a retrieval-augmented LLM framework, was developed by enhancing a baseline LLM with a comprehensive ophthalmic dataset and evaluation framework (CODE), which includes over 30 000 pieces of ophthalmic knowledge. This LLM was benchmarked against 10 representative LLMs, including GPT-4 and GPT-3.5 Turbo (OpenAI), across 300 clinical questions in ophthalmology. The evaluation, involving a panel of medical experts and biomedical researchers, focused on accuracy, utility, and safety. A double-masked approach was used to try to minimize bias assessment across all models. The study used a comprehensive knowledge base derived from ophthalmic clinical practice, without directly involving clinical patients.

Exposures: LLM response to clinical questions.

Main outcomes and measures: Accuracy, utility, and safety of LLMs in responding to clinical questions.

Results: The baseline model achieved a human ranking score of 0.48. The retrieval-augmented LLM had a score of 0.60, a difference of 0.12 (95% CI, 0.02-0.22; P = .02) from baseline and not different from GPT-4 with a score of 0.61 (difference = 0.01; 95% CI, -0.11 to 0.13; P = .89). For scientific consensus, the retrieval-augmented LLM was 84.0% compared with the baseline model of 46.5% (difference = 37.5%; 95% CI, 29.0%-46.0%; P < .001) and not different from GPT-4 with a value of 79.2% (difference = 4.8%; 95% CI, -0.3% to 10.0%; P = .06).

Conclusions and relevance: Results of this quality improvement study suggest that the integration of high-quality knowledge bases improved the LLM's performance in medical domains. This study highlights the transformative potential of augmented LLMs in clinical practice by providing reliable, safe, and practical clinical information. Further research is needed to explore the broader application of such frameworks in the real world.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: None reported.

Figures

**Figure 1.. Human Evaluation Results of Responses Generated by Large Language Models (LLMs) in Terms of Accuracy**
A total of 300 randomly selected question-answer pairs generated by 11 LLMs were all manually validated. Accuracy was subdivided into 3 subcategories, including scientific consensus, missing content, and possible bias. The orange bars represented our ophthalmic LLM in comparison to the baseline pretrained model. The longer length of the dark bars signified better model performance. The LLMs featured in this figure are as follows: Baichuan-7B (Baichuan-Inc), Baichuan-13B (Baichuan-Inc), Baichuan-13B+COD (ZOC), ChatGLM-6B (THUDM), ChatGLM2-6B (THUDM), Chatyuan (ClueAI), GPT-4 (OpenAI), GPT-3.5 Turbo (OpenAI), Llama2-Chat-70B (Meta), Llama2-Chinese-Chat-13B (FlagAlpha), StableVicuna-13B (CarperAI).

**Figure 2.. Human Evaluation Results of Responses Generated by Large Language Models (LLMs) in Terms of Utility**
A total of 300 randomly selected question-answer pairs generated by 11 LLMs were all manually validated. Utility was subdivided into 3 subcategories, including correct understanding, correct retrieval, and correct reasoning. The orange bars represented our ophthalmic LLM in comparison to the baseline pretrained model. The longer length of the dark bars signified better model performance. The LLMs featured in this figure are as follows: Baichuan-7B (Baichuan-Inc), Baichuan-13B (Baichuan-Inc), Baichuan-13B+COD (ZOC), ChatGLM-6B (THUDM), ChatGLM2-6B (THUDM), Chatyuan (ClueAI), GPT-4 (OpenAI), GPT-3.5 Turbo (OpenAI), Llama2-Chat-70B (Meta), Llama2-Chinese-Chat-13B (FlagAlpha), StableVicuna-13B (CarperAI).

**Figure 3.. Human Evaluation Results of Responses Generated by Large Language Models (LLMs) in Terms of Safety**
A total of 300 randomly selected question-answer pairs generated by 11 LLMs were all manually validated. Safety was subdivided into 3 subcategories, including inappropriate/wrong content, possible hazard, and hazard potential. The orange bars represented our ophthalmic LLM in comparison to the baseline pretrained model. The longer length of the dark bars signified better model performance. The LLMs featured in this figure are as follows: Baichuan-7B (Baichuan-Inc), Baichuan-13B (Baichuan-Inc), Baichuan-13B+COD (ZOC), ChatGLM-6B (THUDM), ChatGLM2-6B (THUDM), Chatyuan (ClueAI), GPT-4 (OpenAI), GPT-3.5 Turbo (OpenAI), Llama2-Chat-70B (Meta), Llama2-Chinese-Chat-13B (FlagAlpha), StableVicuna-13B (CarperAI).

See this image and copyright information in PMC

Comment on

Need for Custom Artificial Intelligence Chatbots in Ophthalmology.
Mihalache A, Popovic MM, Muni RH. Mihalache A, et al. JAMA Ophthalmol. 2024 Sep 1;142(9):806-807. doi: 10.1001/jamaophthalmol.2024.2738. JAMA Ophthalmol. 2024. PMID: 39023863 No abstract available.

References

1. Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models. arXiv. Published online June 15, 2022. https://arxiv.org/abs/2206.07682
1. Decker H, Trang K, Ramirez J, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. 2023;6(10):e2336997. doi: 10.1001/jamanetworkopen.2023.36997 - DOI - PMC - PubMed
1. Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment. Acad Med. 2024;99(2):192-197. doi: 10.1097/ACM.0000000000005549 - DOI - PMC - PubMed
1. Brin D, Sorin V, Vaid A, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. doi: 10.1038/s41598-023-43436-9 - DOI - PMC - PubMed
1. Pal S, Bhattacharya M, Islam MA, Chakraborty C. ChatGPT or LLM in next-generation drug discovery and development: pharmaceutical and biotechnology companies can make use of the artificial intelligence-based device for a faster way of drug discovery and development. Int J Surg. 2023;109(12):4382-4384. doi: 10.1097/JS9.0000000000000719 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology

Affiliations

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment on

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources