Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct;27(10):1735-1743.
doi: 10.1038/s41591-021-01506-3. Epub 2021 Sep 15.

Federated learning for predicting clinical outcomes in patients with COVID-19

Ittai Dayan #  1 Holger R Roth #  2 Aoxiao Zhong #  3   4 Ahmed Harouni  2 Amilcare Gentili  5 Anas Z Abidin  2 Andrew Liu  2 Anthony Beardsworth Costa  6 Bradford J Wood  7   8 Chien-Sung Tsai  9 Chih-Hung Wang  10   11 Chun-Nan Hsu  12 C K Lee  2 Peiying Ruan  2 Daguang Xu  2 Dufan Wu  3 Eddie Huang  2 Felipe Campos Kitamura  13 Griffin Lacey  2 Gustavo César de Antônio Corradi  13 Gustavo Nino  14 Hao-Hsin Shin  15 Hirofumi Obinata  16 Hui Ren  3 Jason C Crane  17 Jesse Tetreault  2 Jiahui Guan  2 John W Garrett  18 Joshua D Kaggie  19 Jung Gil Park  20 Keith Dreyer  1   21 Krishna Juluru  15 Kristopher Kersten  2 Marcio Aloisio Bezerra Cavalcanti Rockenbach  21 Marius George Linguraru  22   23 Masoom A Haider  24   25 Meena AbdelMaseeh  25 Nicola Rieke  2 Pablo F Damasceno  17 Pedro Mario Cruz E Silva  2 Pochuan Wang  26   27 Sheng Xu  7   8 Shuichi Kawano  16 Sira Sriswasdi  28   29 Soo Young Park  30 Thomas M Grist  31 Varun Buch  21 Watsamon Jantarabenjakul  32   33 Weichung Wang  26   27 Won Young Tak  30 Xiang Li  3 Xihong Lin  34 Young Joon Kwon  6 Abood Quraini  2 Andrew Feng  2 Andrew N Priest  35 Baris Turkbey  8   36 Benjamin Glicksberg  37 Bernardo Bizzo  21 Byung Seok Kim  38 Carlos Tor-Díez  22 Chia-Cheng Lee  39 Chia-Jung Hsu  39 Chin Lin  40   41   42 Chiu-Ling Lai  43 Christopher P Hess  17 Colin Compas  2 Deepeksha Bhatia  2 Eric K Oermann  44 Evan Leibovitz  21 Hisashi Sasaki  16 Hitoshi Mori  16 Isaac Yang  2 Jae Ho Sohn  17 Krishna Nand Keshava Murthy  15 Li-Chen Fu  45 Matheus Ribeiro Furtado de Mendonça  13 Mike Fralick  46 Min Kyu Kang  20 Mohammad Adil  2 Natalie Gangai  15 Peerapon Vateekul  47 Pierre Elnajjar  15 Sarah Hickman  19 Sharmila Majumdar  17 Shelley L McLeod  48   49 Sheridan Reed  7   8 Stefan Gräf  50 Stephanie Harmon  8   51 Tatsuya Kodama  16 Thanyawee Puthanakit  32   33 Tony Mazzulli  52   53   54 Vitor Lima de Lavor  13 Yothin Rakvongthai  55 Yu Rim Lee  30 Yuhong Wen  2 Fiona J Gilbert #  19 Mona G Flores #  56 Quanzheng Li #  3
Affiliations

Federated learning for predicting clinical outcomes in patients with COVID-19

Ittai Dayan et al. Nat Med. 2021 Oct.

Abstract

Federated learning (FL) is a method used for training artificial intelligence models with data from multiple sources while maintaining data anonymity, thus removing many barriers to data sharing. Here we used data from 20 institutes across the globe to train a FL model, called EXAM (electronic medical record (EMR) chest X-ray AI model), that predicts the future oxygen requirements of symptomatic patients with COVID-19 using inputs of vital signs, laboratory data and chest X-rays. EXAM achieved an average area under the curve (AUC) >0.92 for predicting outcomes at 24 and 72 h from the time of initial presentation to the emergency room, and it provided 16% improvement in average AUC measured across all participating sites and an average increase in generalizability of 38% when compared with models trained at a single site using that site's data. For prediction of mechanical ventilation treatment or death at 24 h at the largest independent test site, EXAM achieved a sensitivity of 0.950 and specificity of 0.882. In this study, FL facilitated rapid data science collaboration without data exchange and generated a model that generalized across heterogeneous, unharmonized datasets for prediction of clinical outcomes in patients with COVID-19, setting the stage for the broader use of FL in healthcare.

PubMed Disclaimer

Conflict of interest statement

Competing interests

Financial competing interests

This study was organized and coordinated by NVIDIA. Y.W., M.A., I.Y., A.Q., C.C., D.B., A.F., H.R., J.G., D.X., N.R., A.H., K.K., C.R., A.A., C.K.L, E.H., A.L., G.L., P.M.C.S, J.T., and M.G.F. are employees of NVIDIA and own stock as part of the standard compensation package.

J.G. declared ownership of NVIDIA Stock.

I.D. is presently an officer and shareholder of a company, Rhino HealthTech Inc., that provides systems for distributed computation, that can among other things, be used to complete federated learning tasks. He was not employed by this company during the execution of the EXAM study.

The remaining authors declare no competing interests.

Non-financial competing interests

C.H. declared Research travel, Siemens Healthineers AG; Conference Travel, EUROKONGRESS; GmBH; and Personal fees (Consultant, GE Healthcare LLC; DSMB Member, Focused Ultrasound Foundation).

F.J.G declared research collaborations with Merantix, Screen-Point, Lunit and Volpara, GE Healthcare and undertakes paid consultancy for Kheiron and Alphabet.

M.L. declared that he is the co-founder of PediaMetrix Inc. and is on the Board of the SIPAIM Foundation

S.E.H declared research collaborations with Merantix, Screen-Point, Lunit and Volpara.

B.J.W and S.X. declared that NIH and NVIDIA have a Cooperative Research and Development Agreement. This work was supported (+/− in part) by the NIH Center for Interventional Oncology and the Intramural Research Program of the National Institutes of Health, via intramural NIH Grants Z1A CL040015, 1ZIDBC011242. Work supported by the NIH Intramural Targeted Anti-COVID-19 (ITAC) Program, funded by the National Institute of Allergy and Infectious Diseases. NIH may have intellectual property in the field.

The remaining authors declare no competing interests.

Figures

Extended Data Fig. 1
Extended Data Fig. 1. Test performance of models predicting 72h oxygen treatment trained on local data only versus the performance of the best global model available on the server.
Test performance of models predicting 72h oxygen treatment trained on local data only (Local) versus the performance of the best global model available on the server (FL (gl. best)). b, Generalizability (average performance on other sites’ test data) as a function of a site’s dataset size (# cases). The average performance improved by 18% (from 0.760 to 0.899 or 13.9 percentage points) compared to locally trained models alone, while average generalizability of the global model improved by 34% (from 0.669 to 0.899 or 23.0 percentage points).
Extended Data Fig. 2
Extended Data Fig. 2. Confusion Matrices at a site with unbalanced data and mostly mild cases.
a, Confusion matrices on the test data at site 16 predicting oxygen treatment at 72h using the locally trained model. b, Confusion matrices on the test data at site 16 predicting oxygen treatment at 72h using the best Federated Learning global model. We show the ROCs for two different cut-off values t of the EXAM risk score.
Extended Data Fig. 3
Extended Data Fig. 3. Effect of data set size on model performance.
ROCs of the best global model in comparison to the mean ROCs of models trained on local datasets to predict 24/72-h oxygen treatment devices for COVID positive/negative patients respectively, using the test data of 5 large datasets from sites in the Boston area. The Mean ROC is calculated based on 5 locally trained models, with the gray-area showing the standard deviation of the ROCs. We show the ROCs for three different cut-off values t of the EXAM risk score.
Extended Data Fig. 4
Extended Data Fig. 4. Failures cases at an independent test site.
CXRs from two failure cases at CDH. The above is noisy data where each available value has been anonymized by adding a zero-mean Gaussian noise with the standard deviation of 1/5 of the standard deviation of the cohort distribution.
Extended Data Fig. 5
Extended Data Fig. 5. Safety enhancing features used in EXAM.
Additional data-safety-enhancing features were assessed by only sharing a certain percentage of weight updates with the largest magnitudes before sending them to the server after each round of learning. We show that by using partial weight updates during FL, models can be trained that reach a performance comparable to training while sharing the full information. This differential privacy technique decreases the risk for model inversion or reconstruction of the training image data through gradient interception.
Extended Data Fig. 6
Extended Data Fig. 6. Characteristics of EMR data used in EXAM.
Min. and max. values (asterisks) and mean and standard deviation (length of bars) for each EMR feature used as an input to the model. n specifies the number of sites that had this particular feature available. Missing values were imputed using a MissedForest algorithm. .
Extended Data Fig. 7
Extended Data Fig. 7. Distribution of oxygen treatments between EXAM sites.
The boxplots show the quartiles of the minimum, the maximum, the sample median, and the first and third quartiles (excluding outliers) of the oxygen treatments applied at different sites at time of Emergency Department admission and after 24 and 72- hour periods. The types of oxygen treatments administered are ‘room air’, ‘low-flow oxygen’, ‘high-flow oxygen (non-invasive)’, and ‘ventilator’.
Extended Data Fig. 8
Extended Data Fig. 8. Site variations in oxygen usage.
Normalized distributions of oxygen devices at different time points, comparing the site with largest dataset size (site 1) and a site with unbalanced data, including mostly mild cases (site #16).
Extended Data Fig. 9
Extended Data Fig. 9. Description of the EXAM Federated Learning study.
a, Previously developed model, CDS, to predict a risk score that corresponds to respiratory outcomes in patients with SARS-COV-2. b, Histogram of CORISK results at MGB, with an illustration of how the score can be used for patient triage, in which ‘A’ is an example threshold for safe discharge that has 99.5% negative predictive value, and ‘B’ is an example threshold for Intensive Care Unit (ICU) admission that has 50.3% positive predictive value. For the purpose of the NPV calculation (threshold A), we defined the Model Inference to be Positive if it predicted oxygen need as LFO or above (COVID risk score ³0.25) and Negative if it predicted oxygen need as RA (<0.25). We defined the Disease to be Negative if the patient was discharged and not readmitted, and Positive if the patient was readmitted for treatment. For the purpose of PPV calculation (threshold B), we defined the Model Inference to be Positive if it predicted oxygen need as MV or above (³0.75) and Negative if it predicted oxygen need as HFO or less (<0.75). We defined the disease to be Positive if the patient required MV or if they died, and we defined the disease as Negative if the patient survived and did not require MV. The EXAM score can be used in the same way. c, Federated Learning using a client-server setup.
Extended Data Fig. 10
Extended Data Fig. 10. Calibration Plots for the MGB data and the new independent dataset, CDH, used for model validation.
Fig. 1 |
Fig. 1 |
Data used in the EXAM FL study. a, World map indicating the 20 different client sites contributing to EXAM study. b, Number of cases contributed by each institution or site contributed (client #1 represents the site contributing the largest number of cases). c, Chest X-ray intensity distribution at each client site. d, Age of patients at each client-site showing the minimum and maximum ages (asterisks), mean age (triangle) and standard deviation (horizontal bar). The number of samples of each client site is shown in Supplemental Table 1.
Fig. 2 |
Fig. 2 |
Performance of federated learning versus local models. a, Performance on each client’s test set for predicting 24h oxygen treatment for models trained on local data only (Local) versus the performance of the best global model available on the server (FL (gl. best)). “Avg.” stands for the average test performance across all sites. b, Generalizability (average performance on other sites’ test data, as represented by average AUC) as a function of a client’s dataset size (# cases). The green horizontal line shows the generalizability performance of the best global model. The performance for 18 of 20 clients is shown, as client 12 had outcomes only for 72 hours (see Extended Data Fig. 1) and client 14 had cases only with room air treatment, such that the evaluation metric (avg. AUC) was not applicable in either of these cases (see Methods). Data for client 14 was also excluded from the computation of the average generalizability of the local models.
Fig. 3 |
Fig. 3 |
Comparison of federated learning-trained to locally-trained models. a, ROC at a site with unbalanced data and mostly mild cases (client-site #16). b, ROC of the local model at client-site #12 (a small dataset), mean ROC of models trained on larger datasets corresponding to the 5 client-sites in the Boston area (#1, #4, #5, #6, #8), and ROC of the best global model to predict oxygen treatment at 72 h for different thresholds of the EXAM score (left, middle, right). The mean ROC is calculated based on 5 locally-trained models, and the gray-area shows the standard deviation of the ROCs. ROCs for three different cut-off values t of the EXAM risk score are shown. “Pos” and “Neg” stand for the number of positive and negative cases defined by this range of the EXAM score, respectively.
Fig. 4 |
Fig. 4 |
Performance of the best global model on the largest independent data set. a, Performance (ROC) (top) and confusion matrices (bottom) of the EXAM FL model on the CDH dataset for predicting oxygen treatment at 24 h. b, Performance (ROC) (top) and confusion matrices (bottom) of the EXAM FL model on the CDH dataset for predicting oxygen treatment at 72 h. ROCs for three different cut-off values t of the EXAM risk score are shown. “Pos” and “Neg” stand for the number of positive and negative cases defined by this range of the EXAM score, respectively.

Update of

  • Federated Learning used for predicting outcomes in SARS-COV-2 patients.
    Flores M, Dayan I, Roth H, Zhong A, Harouni A, Gentili A, Abidin A, Liu A, Costa A, Wood B, Tsai CS, Wang CH, Hsu CN, Lee CK, Ruan C, Xu D, Wu D, Huang E, Kitamura F, Lacey G, César de Antônio Corradi G, Shin HH, Obinata H, Ren H, Crane J, Tetreault J, Guan J, Garrett J, Park JG, Dreyer K, Juluru K, Kersten K, Bezerra Cavalcanti Rockenbach MA, Linguraru M, Haider M, AbdelMaseeh M, Rieke N, Damasceno P, Cruz E Silva PM, Wang P, Xu S, Kawano S, Sriswasdi S, Park SY, Grist T, Buch V, Jantarabenjakul W, Wang W, Tak WY, Li X, Lin X, Kwon F, Gilbert F, Kaggie J, Li Q, Quraini A, Feng A, Priest A, Turkbey B, Glicksberg B, Bizzo B, Kim BS, Tor-Diez C, Lee CC, Hsu CJ, Lin C, Lai CL, Hess C, Compas C, Bhatia D, Oermann E, Leibovitz E, Sasaki H, Mori H, Yang I, Sohn JH, Keshava Murthy KN, Fu LC, Furtado de Mendonça MR, Fralick M, Kang MK, Adil M, Gangai N, Vateekul P, Elnajjar P, Hickman S, Majumdar S, McLeod S, Reed S, Graf S, Harmon S, Kodama T, Puthanakit T, Mazzulli T, de Lima Lavor V, Rakvongthai Y, Lee YR, Wen Y. Flores M, et al. Res Sq [Preprint]. 2021 Jan 8:rs.3.rs-126892. doi: 10.21203/rs.3.rs-126892/v1. Res Sq. 2021. Update in: Nat Med. 2021 Oct;27(10):1735-1743. doi: 10.1038/s41591-021-01506-3. PMID: 33442676 Free PMC article. Updated. Preprint.

References

    1. Budd J et al. Digital technologies in the public-health response to COVID-19. Nat. Med 26, 1183–1192 (2020). - PubMed
    1. Moorthy V, Henao Restrepo AM, Preziosi M-P & Swaminathan S Data sharing for novel coronavirus (COVID-19). Bull. World Health Organ 98, 150 (2020). - PMC - PubMed
    1. Chen Q, Allot A & Lu Z Keep up with the latest coronavirus research. Nature 579, 193 (2020). - PubMed
    1. Fabbri F, Bhatia A, Mayer A, Schlotter B & Kaiser J BCG IT Spend Pulse: How COVID-19 Is Shifting Tech Priorities (2020).
    1. Candelon F, Reichert T, Duranton S, di Carlo RC & De Bondt M The Rise of the AI-Powered Company in the Postcrisis World (2020).

Publication types