Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr 13:7:11305.
doi: 10.1038/ncomms11305.

High-dimensional genomic data bias correction and data integration using MANCIE

Affiliations

High-dimensional genomic data bias correction and data integration using MANCIE

Chongzhi Zang et al. Nat Commun. .

Abstract

High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Overview of MANCIE.
Each row vector in the adjusted matrix is generated from the corresponding row vectors in the main matrix and the associated matrix. On the basis of the correlation between the main row vector mi and the associated row vector ci, one of three scenarios will be chosen. See more details in the online methods.
Figure 2
Figure 2. Case study on ENCODE data.
(a,b) Multi-dimensional scaling map representing genomic data from 61 cell lines. Each data point represents a cell line, with its tissue type labelled in the same colour as in the legend. (a, top) Raw DHS data; bottom, MANCIE-adjusted DHS data; (b, top) Raw expression data; bottom, MANCIE-adjusted expression data. (c,d) Adjusted Rand index comparing K-means clustering on the data with actual tissue-type clustering. K-means clustering was performed 1,000 times with random seeds. The three boxes represent original data (blue), MANCIE-adjusted with random data matrices (cyan) and MANCIE-adjusted with the other data type (red). (c) DHS data, (d) gene-expression data. P value was calculated using Wilcoxon rank sum test. (e) Relationship between the magnitude of MANCIE adjustment and the deviation of GC-content distribution of DNase-seq reads. The magnitude of MANCIE adjustment was calculated as the Euclidean distance between the sample data vectors before and after MANCIE adjustment. The deviation refers to the distance from each sample's data point to the centre of mass in the mean—coefficient of variation map of the GC-content distribution in Supplementary Fig 2c. Labels in the parentheses are the top sequence motif enriched in the most increased DHS in the corresponding cell line after MANCIE adjustment.
Figure 3
Figure 3. Case studies on METABRIC and TCGA data.
(a) The Kaplan–Meier plots for an example showing the dichotomized risk scores from the original matrices (left) and the adjusted matrices (right) under a correlation threshold of 0.93 using the METABRIC data. Patient samples were separated into two groups according to the predicted risk scores from the selected genes. High-risk group is labelled in red and low-risk group is labelled in blue. The high-risk group is better separated from the low-risk group by using the MANCIE-adjusted expression data (right), compared with using the original data (left). (b) P value scores (−log10Pvalue) in survival prediction using METABRIC gene-expression data comparing before or after MANCIE adjustment with CNV data. The gene selection thresholds are set as 0.7, 0.75, 0.8, 0.85, 0.9, 0.93, from left to right, from top to bottom, respectively. (c) Difference of P value scores (−log10Pvalue) in survival prediction with each gene signature using TCGA gene-expression data before or after adjustment by MANCIE or SVA. Gene signatures are labelled with the first author name of the publication. Error bar stands for s.d. of the results from 1,000 random samples.
Figure 4
Figure 4. Case Study on CCLE/GDSC data.
(a,b) Correlation between the CNV and RNA expression for gene NDUFC2. The expression data were using raw CCLE data (a) or MANCIE-adjusted with GDSC data (b). ρ refers to Spearman correlation coefficient. (c) Distribution of the correlation difference comparing before and after MANCIE adjustment, for all genes. The Spearman correlation coefficient between CNV and RNA expression was calculated for each gene, and the correlation difference is calculated by subtracting with MANCIE adjustment by without MANCIE adjustment. P value was calculated using the one-tail paired t-test. (d) Distribution of the correlation difference comparing comparing the raw data with SVA-adjusted expression data. P value was calculated using the one-tail paired t-test.

References

    1. Thurman R. E. et al. The accessible chromatin landscape of the human genome. Nature 488, 75–82 (2013). - PMC - PubMed
    1. The Cancer Genome Atlas Research Network. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550 (2014). - PMC - PubMed
    1. Curtis C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012). - PMC - PubMed
    1. Barretina J. et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–307 (2013). - PMC - PubMed
    1. Leek J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010). - PMC - PubMed

Publication types

LinkOut - more resources