Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 9;21(1):9.
doi: 10.1186/s12859-019-3286-3.

Multiset sparse partial least squares path modeling for high dimensional omics data analysis

Affiliations

Multiset sparse partial least squares path modeling for high dimensional omics data analysis

Attila Csala et al. BMC Bioinformatics. .

Abstract

Background: Recent technological developments have enabled the measurement of a plethora of biomolecular data from various omics domains, and research is ongoing on statistical methods to leverage these omics data to better model and understand biological pathways and genetic architectures of complex phenotypes. Current reviews report that the simultaneous analysis of multiple (i.e. three or more) high dimensional omics data sources is still challenging and suitable statistical methods are unavailable. Often mentioned challenges are the lack of accounting for the hierarchical structure between omics domains and the difficulty of interpretation of genomewide results. This study is motivated to address these challenges. We propose multiset sparse Partial Least Squares path modeling (msPLS), a generalized penalized form of Partial Least Squares path modeling, for the simultaneous modeling of biological pathways across multiple omics domains. msPLS simultaneously models the effect of multiple molecular markers, from multiple omics domains, on the variation of multiple phenotypic variables, while accounting for the relationships between data sources, and provides sparse results. The sparsity in the model helps to provide interpretable results from analyses of hundreds of thousands of biomolecular variables.

Results: With simulation studies, we quantified the ability of msPLS to discover associated variables among high dimensional data sources. Furthermore, we analysed high dimensional omics datasets to explore biological pathways associated with Marfan syndrome and with Chronic Lymphocytic Leukaemia. Additionally, we compared the results of msPLS to the results of Multi-Omics Factor Analysis (MOFA), which is an alternative method to analyse this type of data.

Conclusions: msPLS is an multiset multivariate method for the integrative analysis of multiple high dimensional omics data sources. It accounts for the relationship between multiple high dimensional data sources while it provides interpretable results through its sparse solutions. The biomarkers found by msPLS in the omics datasets can be interpreted in terms of biological pathways associated with the pathophysiology of Marfan syndrome and of Chronic Lymphocytic Leukaemia. Additionally, msPLS outperforms MOFA in terms of variation explained in the chronic lymphocytic leukaemia dataset while it identifies the two most important clinical markers for Chronic Lymphocytic Leukaemia AVAILABILITY: http://uva.csala.me/mspls.https://github.com/acsala/2018_msPLS.

Keywords: High dimensional omics data; Multivariate analysis; Partial least squares.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
msPLS identified a combination of 40 epigenomic markers (denoted as ξ1) and 52 transcriptomic markers (denoted as ξ2) that explain the most variance in the proteome variables. The color scale represents the strength of w regression weights
Fig. 2
Fig. 2
msPLS identified 40 methylation markers and 52 gene expression markers that optimised the sum of squared correlation of the explanatory LVs of the epigenome and transcriptome with the MVs from the proteome
Fig. 3
Fig. 3
The resulting model from Section 3.2 extended to two LVs per dataset. The first set of LVs ξ1(1) and ξ2(1) partition out a different portion of variance in the proteome MVs than the second set of LVs ξ1(2) and ξ2(2). The colour scale represents the strength of w regression weights
Fig. 4
Fig. 4
The co-expression pattern of the resulting Marfan genes queried on their biological process based functions. The figure was produced with GeneMania (available at https://genemania.org)
Fig. 5
Fig. 5
The samples of the CLL data clustered around on their IGHV and trisomy 12 status, extracted by the first and second LV of the msPLS model. The figure was produced by the MOFA R package [11]
Fig. 6
Fig. 6
The proposed relationship between three data sources. X1 and X2 have a symmetric relation (i.e. they are responses for each other) and X3 have asymmetric relation with both X1 and X2 (i.e. X3 is response for both X2 and X1)
Fig. 7
Fig. 7
The null distributions of the optimisation criteria (with respect to X3) for the simulated data with different sample sizes (n = 50, 100, 250), obtained after 1000 permutations. The red bars indicate the optimisation criteria obtained applying msPLS to the original data with the optimal λ1 parameters for UST. The red dots are the bootstrapped values, and the dashed red bars are the 95% confidence intervals

References

    1. Timpson NJ, Greenwood CMT, Soranzo N, Lawson DJ, Richards JB. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat Rev Genet. 2017;19(2):110–24. doi: 10.1038/nrg.2017.101. - DOI - PubMed
    1. Karczewski KJ, Snyder MP. Integrative omics for health and disease. Nat Rev Genet. 2018;19(5):299–310. doi: 10.1038/nrg.2018.4. - DOI - PMC - PubMed
    1. Huang S, Chaudhary K, Garmire LX. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Front Genet. 2017;8(JUN):1–12. - PMC - PubMed
    1. Tenenhaus A, Tenenhaus M. Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. Eur J Oper Res. 2014;238(2):391–403. doi: 10.1016/j.ejor.2014.01.008. - DOI
    1. Tenenhaus A, Philippe C, Guillemot V, Le Cao K-A, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014;15(3):569–83. doi: 10.1093/biostatistics/kxu001. - DOI - PubMed

MeSH terms

LinkOut - more resources