Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jun 9:2025.06.08.25329151.
doi: 10.1101/2025.06.08.25329151.

OphthaBERT: Automated Glaucoma Diagnosis from Clinical Notes

Affiliations

OphthaBERT: Automated Glaucoma Diagnosis from Clinical Notes

Rishi Shah et al. medRxiv. .

Abstract

Glaucoma is a leading cause of irreversible blindness worldwide, with early intervention often being crucial. Research into the underpinnings of glaucoma often relies on electronic health records (EHRs) to identify patients with glaucoma and their subtypes. However, current methods for identifying glaucoma patients from EHRs are often inaccurate or infeasible at scale, relying on International Classification of Diseases (ICD) codes or manual chart reviews. To address this limitation, we introduce (1) OphthaBERT, a powerful general clinical ophthalmology language model trained on over 2 million diverse clinical notes, and (2) a fine-tuned variant of OphthaBERT that automatically extracts binary and subtype glaucoma diagnoses from clinical notes. The base OphthaBERT model is a robust encoder, outperforming state-of-the-art clinical encoders in masked token prediction on out-of-distribution ophthalmology clinical notes and binary glaucoma classification with limited data. We report significant binary classification performance improvements in low-data regimes (p < 0.001, Bonferroni corrected). OphthaBERT is also able to achieve superior classification performance for both binary and subtype diagnosis, outperforming even fine-tuned large decoder-only language models at a fraction of the computational cost. We demonstrate a 0.23-point increase in macro-F1 for subtype diagnosis over ICD codes and strong binary classification performance when externally validated at Wilmer Eye Institute. OphthaBERT provides an interpretable, equitable framework for general ophthalmology language modeling and automated glaucoma diagnosis.

PubMed Disclaimer

Conflict of interest statement

9 Competing Interests N.Z. receives consulting fees from Sanofi.

Figures

Figure 1:
Figure 1:. Model architecture and pretraining sample statistics
(a) General flowchart for utilizing OphthaBERT in ophthalmology language tasks. OphthaBERT can either be fine-tuned, or the pretrained text embeddings produced by the model can be fed directly into task-specific models for efficient domain-aware encoding. In this work, our task module is predicting glaucoma diagnoses from unstructured clinical notes. (b) Distribution of notes from Massachusetts Eye and Ear (MEE) Infirmary by race and subspecialty clinic for masked pretraining of OphthaBERT. (c) Distribution of pretraining notes by contact date. (d) Frequencies of the top 15 bigrams in labeled notes utilized for downstream glaucoma identification. (e) Visualization of the principal components of the embeddings of the [CLS] token for labeled case and control notes before pretraining and fine-tuning OphthaBERT. (f) Distribution of note lengths of case and control notes for supervised tuning.
Figure 2:
Figure 2:. Pretraining improves ophthalmology task performance
(a) Binary glaucoma classification macro averaged F1 score across different fractions of labeled data used in training. We attach a simple classification head to the base OphthaBERT model to directly compare the impact of pretraining over specific architecture. Significance between models was evaluated through paired McNemar’s Test, and p-values were corrected by Bonferroni correction to account for multiple hypothesis testing. The thresholds for significance are: n.s. (−), p < 0.05 (*), p < 0.01 (**), p < 0.001 (***) (b) Top-1 and top-5 accuracy of OphthaBERT and BioClinicalBERT for masked token prediction on an out-of-distribution dataset. Top-1 accuracy is defined as the model’s accuracy in finding the correct token with one guess. Top-5 accuracy is defined as the accuracy of the model to find the correct token among the top 5 predicted tokens.
Figure 3:
Figure 3:. Binary and subtype glaucoma diagnosis performance
(a) Precision-recall curves for each racial group for note-level binary classification. (b) ROC-AUC curves for each racial group for note-level binary classification. (c) Predicted probabilities for glaucoma cases and controls when aggregating note-level labels to produce patient labels. Correct predictions (green) and incorrect predictions (blue) are spread with Gaussian noise along the x-axis for greater clarity. (d) Note prediction-normalized confusion matrix for subtype classification. Predicted classes are reported on the x-axis and true classes are reported on the y-axis. (e) Patient prediction-normalized confusion matrix for subtype classification. Predicted classes are reported on the x-axis and true classes are reported on the y-axis. (f) Note subtype classification performance for each subtype when evaluated against all other classes. (g) Macro-averaged note classification performance for each racial group. (h) Binary note-level classification performance for notes inside and outside of the 512-token BERT context window.
Figure 4:
Figure 4:. Benchmarking and Interpretability
(a) Bootstrapped patient-level binary glaucoma classification performance benchmarked on identical splits. (b) Bootstrapped patient-level subtype glaucoma classification performance benchmarked on identical splits. (c) Example attributions for subtype classification of a ‘suspect’ note utilizing integrated gradients. (d) Words containing tokens with the highest average attributions for notes of primary open angle glaucoma (POAG) subtype labels and glaucoma binary labels. (e) Patient-level subtype glaucoma classification performance benchmarked against ICD codes. (f) Comparison of patient-level subtype classification between OphthaBERT and ICD codes averaged over all subtypes.
Figure 5:
Figure 5:. Hopkins External Validation
(a) Binary note F1 score at various probability cutoff thresholds for a set of 70 notes from Johns Hopkins University (JHU) Department of Ophthalmology. We examine the performance of the model when including and excluding notes with ‘possible glaucoma’ labels for various note types. We examined performance on assessment and plan notes concatenated with overview notes (A&P + Overview), overview notes, and progress notes. (b) Various performance metrics for OphthaBERT’s binary glaucoma diagnosis on the JHU dataset.
Figure 6:
Figure 6:. OphthaBERT Architecture
Architecture of the OphthaBERT model for masked pretraining and glaucoma diagnosis. Masked pretraining involves predicting missing tokens after 15% of the tokens are hidden from the model. The glaucoma model is built on top of the pretrained OphthaBERT model, with a binary classification module for binary diagnosis and a multiclass classification module for subtype identification. The model is trained using a joint loss function that averages binary cross-entropy loss and cross-entropy loss. The model learns shared representations for both tasks, allowing for efficient and effective classification. These shared representations are then fed into the classifiers.

References

    1. Kong Hyoun-Joong. Managing Unstructured Big Data in Healthcare System. Healthcare Informatics Research, 25(1):1–2, January 2019. ISSN 2093–3681. doi: 10.4258/hir.2019.25.1.1. - DOI - PMC - PubMed
    1. Raghavan Preethi, Chen James L., Fosler-Lussier Eric, and Lai Albert M.. How essential are unstructured clinical narratives and information fusion to clinical trial recruitment? AMIA Summits on Translational Science Proceedings, 2014:218–223, April 2014. ISSN 2153–4063. - PMC - PubMed
    1. Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N, Kaiser Ł ukasz, and Polosukhin Illia. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
    1. Huang Kexin, Altosaar Jaan, and Ranganath Rajesh. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, November 2020.
    1. Singhal Karan, Azizi Shekoofeh, Tu Tao, Mahdavi S. Sara, Wei Jason, Hyung Won Chung Nathan Scales, Tanwani Ajay, Heather Cole-Lewis Stephen Pfohl, Payne Perry, Seneviratne Martin, Gamble Paul, Kelly Chris, Babiker Abubakr, Schärli Nathanael, Chowdhery Aakanksha, Mansfield Philip, Demner-Fushman Dina, Arcas Blaise Agüera y, Webster Dale, Corrado Greg S., Matias Yossi, Chou Katherine, Gottweis Juraj, Tomasev Nenad, Liu Yun, Rajkomar Alvin, Barral Joelle, Semturs Christopher, Karthikesalingam Alan, and Natarajan Vivek. Large language models encode clinical knowledge. Nature, 620(7972):172–180, August 2023. ISSN 1476–4687. doi: 10.1038/s41586-023-06291-2. - DOI - PMC - PubMed

Publication types

LinkOut - more resources