Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2023 Jan 17;18(1):e0280192.
doi: 10.1371/journal.pone.0280192. eCollection 2023.

A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data

Affiliations
Multicenter Study

A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data

Zhiyu Yan et al. PLoS One. .

Abstract

Large collaborative research networks provide opportunities to jointly analyze multicenter electronic health record (EHR) data, which can improve the sample size, diversity of the study population, and generalizability of the results. However, there are challenges to analyzing multicenter EHR data including privacy protection, large-scale computation resource requirements, heterogeneity across sites, and correlated observations. In this paper, we propose a federated algorithm for generalized linear mixed models (Fed-GLMM), which can flexibly model multicenter longitudinal or correlated data while accounting for site-level heterogeneity. Fed-GLMM can be applied to both federated and centralized research networks to enable privacy-preserving data integration and improve computational efficiency. By communicating a limited amount of summary statistics, Fed-GLMM can achieve nearly identical results as the gold-standard method where the GLMM is directly fitted to the pooled dataset. We demonstrate the performance of Fed-GLMM in numerical experiments and an application to longitudinal EHR data from multiple healthcare facilities.

PubMed Disclaimer

Conflict of interest statement

I have read the journal’s policy and the authors of this manuscript have the following competing interests: We confirm our competing interest statement for the grants from the National Institute of Neurological Disorders and Stroke and the Patient-Centered Outcomes Research Institute and fees from LifeImage reported by Dr. Lee Schwamm. The grants and fees are outside the submitted work and not funders of the study. We confirm that the competing interests do not alter our adherence to PLOS ONE policies on sharing data and materials.

Figures

Fig 1
Fig 1. Schematic overview of Fed-GLMM.
Fed-GLMM enables the joint implementation of GLMM for EHRs from multiple sites without sharing individual-level data. In step 1, each site fits GLMM locally to obtain the initial parameter estimates. In step 2, each site calculates intermediate summary statistics evaluated at the initial values and broadcasts them to the central analytics. For the k-th site, these summary statistics are denoted as sk and Hk, and they are functions of the local data Dk, the common parameter value β¯, and the site-specific parameter value δ¯k. The local data Dk is composed of the local design matrix for the common fixed effect Xk, the local design matrix for the site-specific fixed effect Wk, and the local outcome vector yk. The site-specific parameter value δ¯k is composed of the values of site-specific fixed effect α¯k and site-specific variance parameter γ¯k. In step 3, the central analytics combines all the local intermediate results to construct a surrogate global likelihood function that provides updates for parameter estimates. Steps 2–3 can be iteratively performed to keep updating parameter estimates.
Fig 2
Fig 2. Accuracy of Fed-GLMM and meta-analysis estimates relative to the gold-standard pooled analysis.
We compared the accuracy of Fed-GLMM with the standard meta-analysis by calculating the median absolute relative difference compared to the gold-standard pooled estimate of the coefficient of a binary exposure variable. The underlying model has a binary outcome, a binary exposure, three more covariates with 8 site-specific fixed effect coefficients for the normally distributed covariate and a patient-level random intercept. The model also includes 8 site-specific parameters for variance components. We considered 25 combinations of outcome and exposure prevalence to assess the model accuracy with 100 simulation replicates per combination. Fed-GLMM demonstrated reduced relative bias after 1–2 iterations compared with the meta-analysis, which was highly biased in the presence of rare events.
Fig 3
Fig 3. Comparison of computation time and estimate accuracy for Fed-GLMM and meta-analysis relative to gold-standard pooled analysis with increasing computing nodes/EHR subsets.
We compared Fed-GLMM with meta-analysis using the ratio (in percentage) of computation time over the pooled analysis. For each simulation replicate, we generated one single centralized EHR. The underlying model has a binary outcome, a binary exposure, three more covariates and a patient-level random intercept. We considered dividing the centralized EHR data into varying numbers of subsets to be computed in parallel. Both Fed-GLMM and the meta-analysis spent less than 5% of the computation time required by the pooled analysis with the number of computing nodes greater than 20. However, the meta-analysis had increased relative bias for the exposure coefficient when the number of subsets increased, while Fed-GLMM retained its accuracy relative to the pooled analysis. The points and bars represent median and interquartile range of computation time and relative bias in percentage respectively.
Fig 4
Fig 4. Adjusted odds ratios of virtual vs. in-person visit by patient and visit characteristics.
Using the forest plot, we visualized the adjusted odds ratios obtained through Fed-GMM for both all facilities (federated setting to demonstrate privacy preservation) and single facility (centralized setting to demonstrate computation improvement). The points and bars represent the point estimates and 95% confidence intervals, respectively. Abbreviations: OR—Odds Ratio; NH—Non-Hispanic; LEP—Limited English Proficiency.

References

    1. Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of EHR: Data quality issues and informatics opportunities. Summit on Translational Bioinformatics 2010;2010:1. - PMC - PubMed
    1. Evans RS. Electronic health records: Then, now, and in the future. Yearbook of Medical Informatics 2016;25:S48–61. doi: 10.15265/IYS-2016-s006 - DOI - PMC - PubMed
    1. Kraus JM, Lausser L, Kuhn P, Jobst F, Bock M, Halanke C, et al.. Big data and precision medicine: Challenges and strategies with healthcare data. International Journal of Data Science and Analytics 2018;6:241–9.
    1. Li S, Cai T, Duan R. Targeting underrepresented populations in precision medicine: A federated transfer learning approach. arXiv Preprint arXiv:210812112 2021. - PMC - PubMed
    1. Fleurence RL, Curtis LH, Califf RM, Platt R, Selby JV, Brown JS. Launching PCORnet, a national patient-centered clinical research network. Journal of the American Medical Informatics Association 2014;21:578–82. doi: 10.1136/amiajnl-2014-002747 - DOI - PMC - PubMed

Publication types

LinkOut - more resources