Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models

Sahana Srinivasan^{1

2}, Xuguang Ai³, Minjie Zou¹, Ke Zou¹, Hyunjae Kim³, Thaddaeus Wai Soon Lo², Krithi Pushpanathan¹, Gabriel Dawei Yang², Jocelyn Hui Lin Goh^{1

2}, Yiming Kong³, Anran Li³, Maxwell B Singer⁴, Kai Jin⁵, Fares Antaki^{6

7}, David Ziyou Chen^{1

8}, Dianbo Liu¹, Ron A Adelman⁴, Qingyu Chen³, Yih Chung Tham^{1

2

9}

Affiliations

¹ Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
² Singapore Eye Research Institute, Singapore National Eye Centre, Singapore.
³ Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, Connecticut.
⁴ Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, Connecticut.
⁵ Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China.
⁶ Cole Eye Institute, Cleveland Clinic, Cleveland, Ohio.
⁷ CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada.
⁸ Department of Ophthalmology, National University Hospital, Singapore.
⁹ Eye Academic Clinical Program, Duke NUS Medical School, Singapore.

PMID: 40742581
PMCID: PMC12314776
DOI: 10.1001/jamaophthalmol.2025.2413

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models

Sahana Srinivasan et al. JAMA Ophthalmol. 2025.

. 2025 Sep 1;143(9):740-748.

doi: 10.1001/jamaophthalmol.2025.2413.

Authors

Affiliations

¹ Centre for Innovation and Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
² Singapore Eye Research Institute, Singapore National Eye Centre, Singapore.
³ Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, Connecticut.
⁴ Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, Connecticut.
⁵ Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China.
⁶ Cole Eye Institute, Cleveland Clinic, Cleveland, Ohio.
⁷ CHUM School of Artificial Intelligence in Healthcare, Montreal, Quebec, Canada.
⁸ Department of Ophthalmology, National University Hospital, Singapore.
⁹ Eye Academic Clinical Program, Duke NUS Medical School, Singapore.

PMID: 40742581
PMCID: PMC12314776
DOI: 10.1001/jamaophthalmol.2025.2413

Abstract

Importance: OpenAI's recent large language model (LLM) o1 has dedicated reasoning capabilities, but it remains untested in specialized medical fields like ophthalmology. Evaluating o1 in ophthalmology is crucial to determine whether its general reasoning can meet specialized needs or if domain-specific LLMs are warranted.

Objective: To assess the performance and reasoning ability of OpenAI's o1 compared with other LLMs on ophthalmological questions.

Design, setting, and participants: In September through October 2024, the LLMs o1, GPT-4o (OpenAI), GPT-4 (OpenAI), GPT-3.5 (OpenAI), Llama 3-8B (Meta), and Gemini 1.5 Pro (Google) were evaluated on 6990 standardized ophthalmology questions from the Medical Multiple-Choice Question Answering (MedMCQA) dataset. The study did not analyze human participants.

Main outcomes and measures: Models were evaluated on performance (accuracy and macro F1 score) and reasoning abilities (text-generation metrics: Recall-Oriented Understudy for Gisting Evaluation [ROUGE-L], BERTScore, BARTScore, AlignScore, and Metric for Evaluation of Translation With Explicit Ordering [METEOR]). Mean scores are reported for o1, while mean differences (Δ) from o1's scores are reported for other models. Expert qualitative evaluation of o1 and GPT-4o responses assessed usefulness, organization, and comprehensibility using 5-point Likert scales.

Results: The LLM o1 achieved the highest accuracy (mean, 0.877; 95% CI, 0.870 to 0.885) and macro F1 score (mean, 0.877; 95% CI, 0.869 to 0.884) (P < .001). In BERTScore, GPT-4o (Δ = 0.012; 95% CI, 0.012 to 0.013) and GPT-4 (Δ = 0.014; 95% CI, 0.014 to 0.015) outperformed o1 (P < .001). Similarly, in AlignScore, GPT-4o (Δ = 0.019; 95% CI, 0.016 to 0.021) and GPT-4 (Δ = 0.024; 95% CI, 0.021 to 0.026) again performed better (P < .001). In ROUGE-L, GPT-4o (Δ = 0.018; 95% CI, 0.017 to 0.019), GPT-4 (Δ = 0.026; 95% CI, 0.025 to 0.027), and GPT-3.5 (Δ = 0.008; 95% CI, 0.007 to 0.009) all outperformed o1 (P < .001). Conversely, o1 led in BARTScore (mean, -4.787; 95% CI, -4.813 to -4.762; P < .001) and METEOR (mean, 0.221; 95% CI, 0.218 to 0.223; P < .001 except GPT-4o). Also, o1 outperformed GPT-4o in usefulness (o1: mean, 4.81; 95% CI, 4.73 to 4.89; GPT-4o: mean, 4.53; 95% CI, 4.40 to 4.65; P < .001) and organization (o1: mean, 4.83; 95% CI, 4.75 to 4.90; GPT-4o: mean, 4.63; 95% CI, 4.51 to 4.74; P = .003).

Conclusions and relevance: This study found that o1 excelled in accuracy but showed inconsistencies in text-generation metrics, trailing GPT-4o and GPT-4; expert reviews found o1's responses to be more clinically useful and better organized than GPT-4o. While o1 demonstrated promise, its performance in addressing ophthalmology-specific challenges is not fully optimal, underscoring the potential need for domain-specialized LLMs and targeted evaluations.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest Disclosures: None reported.

Figures

**Figure 1.. Standardized Prompt Format Used for Each Multiple-Choice Question Item**
Each of the 6990 questions was formatted into a standardized prompt structure—prompt, question, options, and format specification—before being used to test the 6 models.

**Figure 2.. Comparison of the Accuracy, Macro F1 Scores, and Text-Generation Metrics Between OpenAI o1 and 5 Other Large Language Models**
A, The mean accuracy and macro F1 scores of the 6 models. B, The text-generation metric scores of the 6 models were normalized on a scale from 0 to 1, where 1 represents the model with the highest score in that metric and 0 represents the model with the lowest score. METEOR indicates Metric for Evaluation of Translation With Explicit Ordering; ROUGE-L, Recall-Oriented Understudy for Gisting Evaluation.

See this image and copyright information in PMC

References

1. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Guyon I, Von Luxburg U, Bengio S, et al, eds. Advances in Neural Information Processing Systems. Vol 30. Curran Associates; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547de...
1. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6: 1169595. doi: 10.3389/frai.2023.1169595 - DOI
1. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 - DOI
1. Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. 2023;3(4):100324. doi: 10.1016/j.xops.2023.100324 - DOI
1. Lim ZW, Pushpanathan K, Yew SME, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95:104770. doi: 10.1016/j.ebiom.2023.104770 - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models

Affiliations

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources