Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 31;38(21):4919-4926.
doi: 10.1093/bioinformatics/btac616.

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Affiliations

dsMTL: a computational framework for privacy-preserving, distributed multi-task machine learning

Han Cao et al. Bioinformatics. .

Abstract

Motivation: In multi-cohort machine learning studies, it is critical to differentiate between effects that are reproducible across cohorts and those that are cohort-specific. Multi-task learning (MTL) is a machine learning approach that facilitates this differentiation through the simultaneous learning of prediction tasks across cohorts. Since multi-cohort data can often not be combined into a single storage solution, there would be the substantial utility of an MTL application for geographically distributed data sources.

Results: Here, we describe the development of 'dsMTL', a computational framework for privacy-preserving, distributed multi-task machine learning that includes three supervised and one unsupervised algorithms. First, we derive the theoretical properties of these methods and the relevant machine learning workflows to ensure the validity of the software implementation. Second, we implement dsMTL as a library for the R programming language, building on the DataSHIELD platform that supports the federated analysis of sensitive individual-level data. Third, we demonstrate the applicability of dsMTL for comorbidity modeling in distributed data. We show that comorbidity modeling using dsMTL outperformed conventional, federated machine learning, as well as the aggregation of multiple models built on the distributed datasets individually. The application of dsMTL was computationally efficient and highly scalable when applied to moderate-size (n < 500), real expression data given the actual network latency.

Availability and implementation: dsMTL is freely available at https://github.com/transbioZI/dsMTLBase (server-side package) and https://github.com/transbioZI/dsMTLClient (client-side package).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Schematic illustration of dsMTL using comorbidity modeling of schizophrenia and cardiovascular disease as an example. Multiple datasets stored at different institutions are used as a basis for federated MTL. dsMTL was developed in the DataSHIELD ecosystem, which provides functionality regarding data management, transmission and security. Data are analyzed behind a given institution’s firewall and only algorithm parameters that do not disclose personally identifiable information are exchanged across the network. dsMTL contains algorithms for supervised and unsupervised multi-task machine learning. The former aims at identifying shared, but potentially heterogeneous signatures across tasks (here, diagnostic classification for schizophrenia and cardiovascular disease). Unsupervised learning separates the original data into shared and cohort-specific components, and aims to reveal the corresponding outcome-associated biological profiles
Fig. 2.
Fig. 2.
Analysis of ‘heterogeneous’ signatures of continuous outcomes in simulated data stored on three servers. The figure shows the: (a) prediction accuracy expressed as the mean squared error and (b) the feature selection accuracy for different subject/feature number ratios. The respective values were averaged across the three servers, and across 100 repetitions, in order to account for the effect of sampling variability
Fig. 3.
Fig. 3.
The gene identification accuracy for shared and specific signatures using simulated data. (a) The identification accuracy of cohort-specific genes for cohort 1. (b) The identification accuracy of cohort-specific genes for cohort 2. (c) The identification accuracy of genes comprised in the shared signature. Local-NMF1 and Local-NMF2 were the cohort-specific gene sets identified by local NMF, which were combined into ‘NMF-bagging’ for the shared gene set. dsMTL_iNMF-H was the predicted shared gene set using dsMTL_iNMF. dsMTL_iNMF-V1 and dsMTL_iNMF-V2 were the predicted cohort-specific gene sets identified using dsMTL_iNMF. The proportion of genes harbored by the shared signature varied from 20% to 80%, illustrating the impact of the heterogeneity severity. The model was trained using rank = 4 as the model parameter. The results for a broader spectrum of rank choices can be found in Supplementary Figure S5, illustrating that the superior performance of dsMTL_iNMF was not due to the choice of ranks
Fig. 4.
Fig. 4.
Scalability analysis for up to 20 servers. (a) The result of the dsMTL_L21 method. (b) The result of the dsLasso method. Both panels show the communication cost (e.g. the number of network accesses) with an increasing number of servers, and for different subject numbers by feature number ratios

References

    1. Akgun M. et al. (2021) Identifying disease-causing mutations with privacy protection. Bioinformatics, 36, 5205–5213. - PMC - PubMed
    1. Akgun M. et al. (2022) Efficient privacy-preserving whole genome variant queries. Bioinformatics, 38, 2202–2210. - PMC - PubMed
    1. Cao H. et al. (2018) Comparative evaluation of machine learning strategies for analyzing big data in psychiatry. Int. J. Mol. Sci., 19, 3387. - PMC - PubMed
    1. Cao H. et al. (2019) RMTL: an R library for multi-task learning. Bioinformatics, 35, 1797–1798. - PubMed
    1. Consotia T.D. (2019). Community packages of DataSHIELD. https://www.datashield.org/help/community-packages.

Publication types