Federated learning for multi-omics: A performance evaluation in Parkinson's disease

Benjamin P Danek^{1

2

3}, Mary B Makarious^{4

5

6}, Anant Dadu^{2

3}, Dan Vitale^{2

3}, Paul Suhwan Lee², Andrew B Singleton^{2

4}, Mike A Nalls^{2

3

4}, Jimeng Sun^{1

7}, Faraz Faghri^{2

3

4}

Affiliations

¹ Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA.
² Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.
³ DataTecnica, Washington, DC 20037, USA.
⁴ Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD 20892, USA.
⁵ Department of Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, London, UK.
⁶ UCL Movement Disorders Centre, University College London, London, UK.
⁷ Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA.

PMID: 38487808
PMCID: PMC10935499
DOI: 10.1016/j.patter.2024.100945

Federated learning for multi-omics: A performance evaluation in Parkinson's disease

Benjamin P Danek et al. Patterns (N Y). 2024.

. 2024 Mar 1;5(3):100945.

doi: 10.1016/j.patter.2024.100945. eCollection 2024 Mar 8.

Authors

Affiliations

¹ Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA.
² Center for Alzheimer's and Related Dementias (CARD), National Institute on Aging and National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.
³ DataTecnica, Washington, DC 20037, USA.
⁴ Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD 20892, USA.
⁵ Department of Clinical and Movement Neurosciences, UCL Queen Square Institute of Neurology, London, UK.
⁶ UCL Movement Disorders Centre, University College London, London, UK.
⁷ Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA.

PMID: 38487808
PMCID: PMC10935499
DOI: 10.1016/j.patter.2024.100945

Abstract

While machine learning (ML) research has recently grown more in popularity, its application in the omics domain is constrained by access to sufficiently large, high-quality datasets needed to train ML models. Federated learning (FL) represents an opportunity to enable collaborative curation of such datasets among participating institutions. We compare the simulated performance of several models trained using FL against classically trained ML models on the task of multi-omics Parkinson's disease prediction. We find that FL model performance tracks centrally trained ML models, where the most performant FL model achieves an AUC-PR of 0.876 ± 0.009, 0.014 ± 0.003 less than its centrally trained variation. We also determine that the dispersion of samples within a federation plays a meaningful role in model performance. Our study implements several open-source FL frameworks and aims to highlight some of the challenges and opportunities when applying these collaborative methods in multi-omics studies.

Keywords: Parkinson’s disease diagnosis; federated learning; machine learning; omics data analysis.

PubMed Disclaimer

Conflict of interest statement

B.P.D., A.D., D.V., M.A.N., and F.F. declare the following competing financial interests, as their participation in this project was part of a competitive contract awarded to Data Tecnica LLC by the National Institutes of Health to support open science research. M.A.N. also currently serves on the scientific advisory board for Character Bio and is an advisor to Neuron23 Inc. The study’s funders had no role in the study design, data collection, data analysis, data interpretation, or writing of the report. F.F. takes final responsibility for the decision to submit the paper for publication.

Figures

**Figure 1**
Experiment workflow diagram and data summary The harmonized and joint-called PPMI and PDBP cohorts originate from the AMP-PD initiative. The PPMI cohort is split into K folds, where one fold is left as a holdout (internal) test set and the remaining are used for model fitting. The training folds are split using an 80:20 ratio to form the training validation split. The training split is distributed among n clients using one of the split strategies to simulate the cross-silo collaborative training setting. FL methods consist of a local learner and an aggregation method. Similarly, several central algorithms are used to fit the training data. The resultant global FL models and the ML models resulting from central training are tested on the PPMI holdout fold (internal test) and the whole PDBP test set (external test).

**Figure 2**
Federated architecture and training summary The FL architecture used in the study also illustrates one round of FL training for the case of n = 3 clients. The aggregation server aggregates trained local learner parameters from clients and computing a global model. Client sites contain their own siloed dataset, each with different samples. The trained client parameters are represented by the blue, orange, and green weights; the black weights represent the aggregated global model. Client model aggregation implemented by the FL strategy is denoted by f. Once global weights are computed, a copy is sent to each client; the global model is used to initialize the local learner model weights in subsequent FL training rounds.

**Figure 3**
Federated learning models trained using publicly available and accessible framework results follow central model performance Area under the precision-recall curve (AUC-PR) comparing central algorithms against federated algorithms. We pair FL algorithms with central algorithms by the local learning algorithm applied at client sites. Federated algorithms receive the training dataset split across 20 n = 2 clients, using label stratified random sampling. Presented data are mean score and standard deviation resulting from cross-validation.

**Figure 4**
Sample dispersion among client sites negatively impacts global model performance For a fixed training dataset, the AUC-PR of federated algorithms as the quantity of client sites increases. Training data are split uniformly among each member of the federation using stratified random sampling. The PDBP and PPMI datasets are used for external and internal validation, respectively. Presented data are mean score and standard deviation resulting from cross-validation.

**Figure 5**
Data heterogeneity at client sites does not deeply influence model performance The AUC-PR for a federation of two clients for several split methods. Uniform stratified sampling represents the most homogeneous data-distribution method, while uniform random and linear random represent increasingly heterogeneous client distributions. Presented data are mean score and standard deviation resulting from cross-validation.

**Figure 6**
The mean runtime to train FL models using FedAvg and FedProx strategies The mean total runtime in seconds to train FL models. FL models are trained on the PPMI training folds for five communication rounds. Algorithms are grouped by aggregation strategy. Results presented as mean and standard deviation over K = 6 folds.

See this image and copyright information in PMC

Update of

Federated Learning for multi-omics: a performance evaluation in Parkinson's disease.
Danek B, Makarious MB, Dadu A, Vitale D, Lee PS, Nalls MA, Sun J, Faghri F. Danek B, et al. bioRxiv [Preprint]. 2024 Feb 12:2023.10.04.560604. doi: 10.1101/2023.10.04.560604. bioRxiv. 2024. Update in: Patterns (N Y). 2024 Mar 01;5(3):100945. doi: 10.1016/j.patter.2024.100945. PMID: 37986893 Free PMC article. Updated. Preprint.

References

1. Dadu A., Satone V., Kaur R., Hashemi S.H., Leonard H., Iwaki H., Makarious M.B., Billingsley K.J., Bandres-Ciga S., Sargent L.J., et al. Identification and prediction of Parkinson’s disease subtypes and progression using machine learning in two cohorts. NPJ Parkinsons Dis. 2022;8:172. doi: 10.1038/s41531-022-00439-z. - DOI - PMC - PubMed
1. Prashanth R., Dutta Roy S., Mandal P.K., Ghosh S. High-Accuracy Detection of Early Parkinson’s Disease through Multimodal Features and Machine Learning. Int. J. Med. Inf. 2016;90:13–21. doi: 10.1016/j.ijmedinf.2016.03.001. - DOI - PubMed
1. Pantaleo E., Monaco A., Amoroso N., Lombardi A., Bellantuono L., Urso D., Lo Giudice C., Picardi E., Tafuri B., Nigro S., et al. A Machine Learning Approach to Parkinson’s Disease Blood Transcriptomics. Genes. 2022;13 doi: 10.3390/genes13050727. - DOI - PMC - PubMed
1. Lee D.A., Lee H.-J., Kim H.C., Park K.M. Application of machine learning analysis based on diffusion tensor imaging to identify REM sleep behavior disorder. Sleep Breath. 2022;26:633–640. doi: 10.1007/s11325-021-02434-9. - DOI - PubMed
1. Green E.D., Gunter C., Biesecker L.G., Di Francesco V., Easter C.L., Feingold E.A., Felsenfeld A.L., Kaufman D.J., Ostrander E.A., Pavan W.J., et al. Strategic vision for improving human health at The Forefront of Genomics. Nature. 2020;586:683–692. doi: 10.1038/s41586-020-2817-4. - DOI - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Federated learning for multi-omics: A performance evaluation in Parkinson's disease

Affiliations

Federated learning for multi-omics: A performance evaluation in Parkinson's disease

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

References

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials