Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 6;17 Suppl 5(Suppl 5):180.
doi: 10.1186/s12859-016-1043-4.

Integration of multi-omics data for prediction of phenotypic traits using random forest

Affiliations

Integration of multi-omics data for prediction of phenotypic traits using random forest

Animesh Acharjee et al. BMC Bioinformatics. .

Abstract

Background: In order to find genetic and metabolic pathways related to phenotypic traits of interest, we analyzed gene expression data, metabolite data obtained with GC-MS and LC-MS, proteomics data and a selected set of tuber quality phenotypic data from a diploid segregating mapping population of potato. In this study we present an approach to integrate these ~ omics data sets for the purpose of predicting phenotypic traits. This gives us networks of relatively small sets of interrelated ~ omics variables that can predict, with higher accuracy, a quality trait of interest.

Results: We used Random Forest regression for integrating multiple ~ omics data for prediction of four quality traits of potato: tuber flesh colour, DSC onset, tuber shape and enzymatic discoloration. For tuber flesh colour beta-carotene hydroxylase and zeaxanthin epoxidase were ranked first and forty-fourth respectively both of which have previously been associated with flesh colour in potato tubers. Combining all the significant genes, LC-peaks, GC-peaks and proteins, the variation explained was 75 %, only slightly more than what gene expression or LC-MS data explain by themselves which indicates that there are correlations among the variables across data sets. For tuber shape regressed on the gene expression, LC-MS, GC-MS and proteomics data sets separately, only gene expression data was found to explain significant variation. For DSC onset, we found 12 significant gene expression, 5 metabolite levels (GC) and 2 proteins that are associated with the trait. Using those 19 significant variables, the variation explained was 45 %. Expression QTL (eQTL) analyses showed many associations with genomic regions in chromosome 2 with also the highest explained variation compared to other chromosomes. Transcriptomics and metabolomics analysis on enzymatic discoloration after 5 min resulted in 420 significant genes and 8 significant LC metabolites, among which two were putatively identified as caffeoylquinic acid methyl ester and tyrosine.

Conclusions: In this study, we made a strategy for selecting and integrating multiple ~ omics data using random forest method and selected representative individual peaks for networks based on eQTL, mQTL or pQTL information. Network analysis was done to interpret how a particular trait is associated with gene expression, metabolite and protein data.

Keywords: Data integration; Genetical genomics; Networks; Random forest.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
(a) A partial correlation network of the phenotypic trait tuber flesh colour (yellow) with gene expression features (red), metabolites from LC-MS (black), metabolites from GC-MS (purple) and proteins (green). The dotted lines represent negative partial correlation coefficients, solid lines represent positive partial correlation coefficients. Bch = beta-carotene hydroxylase, LC_X represents metabolites derived from LC-MS with centrotype number X, GC_X represents metabolites derived from GC-MS with centrotype X, Gene_X = Gene with gene ID X. Pro_X represents a protein with protein ID X. (b) Shows the existing published part of the carotenoid pathway [37], and some of the genes: Bch and Zep are identified by our data
Fig. 2
Fig. 2
A partial correlation network of tuber shape (yellow) with gene expression features (red). The dotted lines represent negative partial correlation coefficients, solid lines represent positive partial correlation coefficients. Gene_X = Gene with gene ID X
Fig. 3
Fig. 3
A partial correlation network of DSC onset (yellow) with gene expression features (red), metabolites from GC-MS (purple) and proteins (green). The dotted lines represent negative partial correlation coefficients, solid lines represent positive partial correlation coefficients. GC_X represents metabolites derived from GC-MS with centrotype X, Gene_X = Gene with gene ID X. Pro_X represents proteins with protein ID X
Fig. 4
Fig. 4
A partial correlation network of enzymatic discoloration (yellow) with gene expression features (red), metabolites from LC-MS (black) and proteins (green). The dotted lines represent negative partial correlation coefficients, solid lines represent positive partial correlation coefficients. LC_X represents metabolites derived from LC-MS with centrotype number X, Gene_X = Gene with gene ID X. Pro_X represents proteins with protein ID X

References

    1. Fukushima A, Kusano M, Redestig H, Arita M, Saito K. Integrated omics approaches in plant systems biology. Curr Opin Chem Biol. 2009;13(5–6):532–538. doi: 10.1016/j.cbpa.2009.09.022. - DOI - PubMed
    1. Kim TY, Kim HU, Lee SY. Data integration and analysis of biological networks. Curr Opin Biotech. 2010;21(1):78–84. doi: 10.1016/j.copbio.2010.01.003. - DOI - PubMed
    1. Fukushima A, Kanaya S, Nishida K. Integrated network analysis and effective tools in plant systems biology. Front Plant Sci. 2014;5:598. doi: 10.3389/fpls.2014.00598. - DOI - PMC - PubMed
    1. Brazma A, Vilo J. Gene expression data analysis. FEBS J. 2000;480(1):17–24. doi: 10.1016/S0014-5793(00)01772-5. - DOI - PubMed
    1. Gaasterland T, Bekiranov S. Making the most of microarray data. Nat Genet. 2000;24(3):204–206. doi: 10.1038/73392. - DOI - PubMed

MeSH terms

LinkOut - more resources