Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2022 Aug;6(8):657-663.
doi: 10.1016/j.oret.2022.02.015. Epub 2022 Mar 14.

Federated Learning for Multicenter Collaboration in Ophthalmology: Improving Classification Performance in Retinopathy of Prematurity

Collaborators, Affiliations
Multicenter Study

Federated Learning for Multicenter Collaboration in Ophthalmology: Improving Classification Performance in Retinopathy of Prematurity

Charles Lu et al. Ophthalmol Retina. 2022 Aug.

Abstract

Objective: To compare the performance of deep learning classifiers for the diagnosis of plus disease in retinopathy of prematurity (ROP) trained using 2 methods for developing models on multi-institutional data sets: centralizing data versus federated learning (FL) in which no data leave each institution.

Design: Evaluation of a diagnostic test or technology.

Subjects: Deep learning models were trained, validated, and tested on 5255 wide-angle retinal images in the neonatal intensive care units of 7 institutions as part of the Imaging and Informatics in ROP study. All images were labeled for the presence of plus, preplus, or no plus disease with a clinical label and a reference standard diagnosis (RSD) determined by 3 image-based ROP graders and the clinical diagnosis.

Methods: We compared the area under the receiver operating characteristic curve (AUROC) for models developed on multi-institutional data, using a central approach initially, followed by FL, and compared locally trained models with both approaches. We compared the model performance (κ) with the label agreement (between clinical and RSD), data set size, and number of plus disease cases in each training cohort using the Spearman correlation coefficient (CC).

Main outcome measures: Model performance using AUROC and linearly weighted κ.

Results: Four settings of experiment were used: FL trained on RSD against central trained on RSD, FL trained on clinical labels against central trained on clinical labels, FL trained on RSD against central trained on clinical labels, and FL trained on clinical labels against central trained on RSD (P = 0.046, P = 0.126, P = 0.224, and P = 0.0173, respectively). Four of the 7 (57%) models trained on local institutional data performed inferiorly to the FL models. The model performance for local models was positively correlated with the label agreement (between clinical and RSD labels, CC = 0.389, P = 0.387), total number of plus cases (CC = 0.759, P = 0.047), and overall training set size (CC = 0.924, P = 0.002).

Conclusions: We found that a trained FL model performs comparably to a centralized model, confirming that FL may provide an effective, more feasible solution for interinstitutional learning. Smaller institutions benefit more from collaboration than larger institutions, showing the potential of FL for addressing disparities in resource access.

Keywords: Deep learning; Epidemiology; Federated learning; Retinopathy of prematurity.

PubMed Disclaimer

Conflict of interest statement

  1. Drs. Campbell, Chan, and Kalpathy-Cramer receive research support from Genentech (San Francisco, CA). Dr. Chiang previously received research support from Genentech.

  2. The i-ROP DL system has been licensed to Boston AI Lab (Boston, MA) by Oregon Health & Science University, Massachusetts General Hospital, Northeastern University, and the University of Illinois, Chicago, which may result in royalties to Drs. Chan, Campbell, Brown, and Kalpathy-Cramer in the future.

  3. Dr. Campbell was a consultant to Boston AI Lab (Boston, MA).

  4. Dr. Chan is on the Scientific Advisory Board for Phoenix Technology Group (Pleasanton, CA), a consultant for Alcon (Ft Worth, TX).

  5. Dr. Chiang was previously a consultant for Novartis (Basel, Switzerland), and was previously an equity owner of InTeleretina, LLC (Honolulu, HI).

  6. Drs. Chan and Campbell are equity owners of Siloam Vision

Figures

Figure 1:
Figure 1:. Federated learning training schema.
During each round of training, the global model is synced to all institutions locally (left). Then, each institution trains for a fixed number of epochs (center) before the local model weights are aggregated and averaged in the central server to update the global model (right).
Figure 2.
Figure 2.. Performance of federated and centrally trained models by ground truth.
The AUROC was high for all four approaches, but models trained using a central approach slightly outperformed models trained using a federated learning approach, as did models trained using reference standard diagnosis (RSD) labels compared with clinical labels (when evaluated against RSD ground truth).
Figure 3.
Figure 3.. Comparative performance of single institution versus multi-institutional models.
We compared the area under the receiver operating characteristic (AUROC) performance for locally trained models using clinical labels on the average of RSD test sets. 4 / 7 (57%) of locally trained models (in blue) performed inferiorly to both central and federated learning models on data from their own institution labeled with a reference standard label.
Figure 4.
Figure 4.. Relationship between label agreement, training dataset size, disease prevalence, and performance.
There was a significant correlation between clinical vs. reference standard diagnosis (RSD) label agreement and average kappa performance of the model versus a RSD (Pearson coefficient 0.389 [p=0.387]). Pearson’s correlation between number of plus cases in the training set and kappa performance was 0.759 with p=0.047) while the correlation between overall training set size and kappa performance was 0.924 with p=0.002).

Comment in

References

    1. Lin T-Y, Maire M, Belongie S, et al. Microsoft COCO: Common Objects in Context. Computer Vision – ECCV 2014. 2014:740–755. Available at: 10.1007/978-3-319-10602-1_48. - DOI
    1. Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 2016;316:2402–2410. - PubMed
    1. Dunnmon JA, Yi D, Langlotz CP, et al. Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology 2019;290:537–544. - PMC - PubMed
    1. Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition 2009. Available at: 10.1109/cvprw.2009.5206848. - DOI
    1. Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. - PMC - PubMed

Publication types

LinkOut - more resources