dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Affiliations

¹ Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim 68158, Germany.
² Health Data Science Unit, Medical Faculty Heidelberg & BioQuant, Heidelberg 69120, Germany.
³ Chair of Computational Systems Biology, University of Hamburg, Hamburg 22607, Germany.
⁴ Computational Biomedicine Lab, Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark.
⁵ Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne NE2 4AX, UK.
⁶ Department of Psychiatry and Psychotherapy, Section for Neurodiagnostic Applications, Ludwig-Maximilian University, Munich 80638, Germany.
⁷ Epigeny, St Ouen, France.

PMID: 36073911
PMCID: PMC9620828
DOI: 10.1093/bioinformatics/btac616

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Han Cao et al. Bioinformatics. 2022.

. 2022 Oct 31;38(21):4919-4926.

doi: 10.1093/bioinformatics/btac616.

Authors

Affiliations

¹ Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim 68158, Germany.
² Health Data Science Unit, Medical Faculty Heidelberg & BioQuant, Heidelberg 69120, Germany.
³ Chair of Computational Systems Biology, University of Hamburg, Hamburg 22607, Germany.
⁴ Computational Biomedicine Lab, Department of Mathematics and Computer Science, University of Southern Denmark, Odense 5230, Denmark.
⁵ Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne NE2 4AX, UK.
⁶ Department of Psychiatry and Psychotherapy, Section for Neurodiagnostic Applications, Ludwig-Maximilian University, Munich 80638, Germany.
⁷ Epigeny, St Ouen, France.

PMID: 36073911
PMCID: PMC9620828
DOI: 10.1093/bioinformatics/btac616

Abstract

Motivation: In multi-cohort machine learning studies, it is critical to differentiate between effects that are reproducible across cohorts and those that are cohort-specific. Multi-task learning (MTL) is a machine learning approach that facilitates this differentiation through the simultaneous learning of prediction tasks across cohorts. Since multi-cohort data can often not be combined into a single storage solution, there would be the substantial utility of an MTL application for geographically distributed data sources.

Results: Here, we describe the development of 'dsMTL', a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. First, we derive the theoretical properties of these methods and the relevant machine learning workflows to ensure the validity of the software implementation. Second, we implement dsMTL as a library for the R programming language, building on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. Third, we demonstrate the applicability of dsMTL for comorbidity modeling in distributed data. We show that comorbidity modeling using dsMTL outperformed conventional, federated machine learning, as well as the aggregation of multiple models built on the distributed datasets individually. The application of dsMTL was computationally efficient and highly scalable when applied to moderate-size (n < 500), real expression data given the actual network latency.

Availability and implementation: dsMTL is freely available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematic illustration of dsMTL using comorbidity modeling of schizophrenia and cardiovascular disease as an example. Multiple datasets stored at different institutions are used as a basis for federated MTL. dsMTL was developed in the DataSHIELD ecosystem, which provides functionality regarding data management, transmission and security. Data are analyzed behind a given institution’s firewall and only algorithm parameters that do not disclose personally identifiable information are exchanged across the network. dsMTL contains algorithms for supervised and unsupervised multi-task machine learning. The former aims at identifying shared, but potentially heterogeneous signatures across tasks (here, diagnostic classification for schizophrenia and cardiovascular disease). Unsupervised learning separates the original data into shared and cohort-specific components, and aims to reveal the corresponding outcome-associated biological profiles

**Fig. 2.**
Analysis of ‘heterogeneous’ signatures of continuous outcomes in simulated data stored on three servers. The figure shows the: (a) prediction accuracy expressed as the mean squared error and (b) the feature selection accuracy for different subject/feature number ratios. The respective values were averaged across the three servers, and across 100 repetitions, in order to account for the effect of sampling variability

**Fig. 3.**
The gene identification accuracy for shared and specific signatures using simulated data. (a) The identification accuracy of cohort-specific genes for cohort 1. (b) The identification accuracy of cohort-specific genes for cohort 2. (c) The identification accuracy of genes comprised in the shared signature. Local-NMF1 and Local-NMF2 were the cohort-specific gene sets identified by local NMF, which were combined into ‘NMF-bagging’ for the shared gene set. dsMTL_iNMF-H was the predicted shared gene set using dsMTL_iNMF. dsMTL_iNMF-V1 and dsMTL_iNMF-V2 were the predicted cohort-specific gene sets identified using dsMTL_iNMF. The proportion of genes harbored by the shared signature varied from 20% to 80%, illustrating the impact of the heterogeneity severity. The model was trained using rank = 4 as the model parameter. The results for a broader spectrum of rank choices can be found in Supplementary Figure S5, illustrating that the superior performance of dsMTL_iNMF was not due to the choice of ranks

**Fig. 4.**
Scalability analysis for up to 20 servers. (a) The result of the dsMTL_L21 method. (b) The result of the dsLasso method. Both panels show the communication cost (e.g. the number of network accesses) with an increasing number of servers, and for different subject numbers by feature number ratios

See this image and copyright information in PMC

References

1. Akgun M. et al. (2021) Identifying disease-causing mutations with privacy protection. Bioinformatics, 36, 5205–5213. - PMC - PubMed
1. Akgun M. et al. (2022) Efficient privacy-preserving whole genome variant queries. Bioinformatics, 38, 2202–2210. - PMC - PubMed
1. Cao H. et al. (2018) Comparative evaluation of machine learning strategies for analyzing big data in psychiatry. Int. J. Mol. Sci., 19, 3387. - PMC - PubMed
1. Cao H. et al. (2019) RMTL: an R library for multi-task learning. Bioinformatics, 35, 1797–1798. - PubMed
1. Consotia T.D. (2019). Community packages of DataSHIELD. https://www.datashield.org/help/community-packages.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

NCT00001260/Intramural Research Program of the NIMH

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Affiliations

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources