Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 14:12:e53622.
doi: 10.2196/53622.

Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics

Affiliations

Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics

Félix Camirand Lemyre et al. JMIR Med Inform. .

Erratum in

Abstract

Background: Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for health frameworks.

Objective: This study aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data, (2) describing the methods applicable to generalized linear models (GLMs) and assessing their underlying distributional assumptions, and (3) adapting existing methods to make them fully usable in health settings.

Methods: A scoping review methodology was used for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in health settings. Statistical theory was used to adapt methods and derive the properties of the resulting estimators.

Results: From the review, 41 articles were selected and 6 approaches were extracted to conduct standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information sharing requirements and operational complexity.

Conclusions: This study contributes to the field of health analytics by providing an overview of the methods that can be used with horizontally partitioned data by adapting these methods to the context of heterogeneous health data and clarifying the workflows and quantities exchanged by the methods discussed. Further analysis of the confidentiality preserved by these methods is needed to fully understand the risk associated with the sharing of summary statistics.

Keywords: GLMs; algorithms; data custodians; data science; distributed algorithms; distributed analysis; federated analysis; generalized linear models; horizontally partitioned data; learning health systems; review methods; scoping; searches; statistics; synthesis.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: None declared.

Figures

Figure 1.
Figure 1.. Article selection process for the scoping review. Detailed inclusion and exclusion criteria are described in the text and in the protocol.
Figure 2.
Figure 2.. Workflow I: each node calculates summary statistics from its own samples. Results are sent to the coordinating center, which combines the information provided by each node to produce the final estimates.
Figure 3.
Figure 3.. Workflow II: multiple communication rounds are allowed between the coordinating center and the data storage nodes.
Figure 4.
Figure 4.. Workflow III: multiple communication rounds are allowed between the coordinating center and the data storage nodes, with node 1 following a distinct communication pattern compared to the other nodes.
Figure 5.
Figure 5.. Workflow IV: multiple communication rounds are allowed between the coordinating center (CC) and the data storage nodes, with 2 back-and-forth distinct communication exchanges between each node and the CC at each iteration.

References

    1. Sinha BK, Hartung J, Knapp G. Statistical Meta-Analysis with Applications. John Wiley & Sons; 2011. ISBN.9780470290897
    1. Duan R, Boland MR, Liu Z, et al. Learning from electronic health records across multiple sites: A communication-efficient and privacy-preserving distributed algorithm. J Am Med Inform Assoc. 2020 Mar 1;27(3):376–385. doi: 10.1093/jamia/ocz199. doi. Medline. - DOI - PMC - PubMed
    1. Gao Y, Liu W, Wang H, Wang X, Yan Y, Zhang R. A review of distributed statistical inference. Stat Theory Relat Fields. 2022 May 27;6(2):89–99. doi: 10.1080/24754269.2021.1974158. doi. - DOI
    1. Huo X, Cao S. Aggregated inference. WIREs Comp Stats. 2019 Jan;11(1):e1451. doi: 10.1002/wics.1451. doi. - DOI
    1. Agresti A. Foundations of Linear and Generalized Linear Models. John Wiley & Sons; 2015. ISBN.9781118730034

Publication types

LinkOut - more resources