Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jan 1:123:270-280.
doi: 10.1016/j.jmva.2013.09.011.

Integrative correlation: Properties and relation to canonical correlations

Affiliations

Integrative correlation: Properties and relation to canonical correlations

Leslie Cope et al. J Multivar Anal. .

Abstract

The integrative correlation coefficient was developed to facilitate the validation of expression microarray results in public datasets, by identifying genes that are reproducibly measured across studies and even across microarray platforms. In the current study, we develop a number of interesting and important mathematical and statistical properties of the integrative correlation coefficient, including a unique permutation-based null distribution with the unusual property that the variance does not shrink as the sample size increases, discussing how these findings impact its use and interpretation, and what they have to say about any method for identifying reproducible genes in a meta-analysis.

Keywords: Bioinformatics; Correlation; Cross-study validation; Gene expression; Reproducibility; Statistics.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The distribution of null integrative correlations are plotted in blue, with the 99th percentile marked as a vertical line. The observed integrative correlations are plotted in red by simulation group, and show the expected decline as the probes used in each study become more independent.
Fig. 2
Fig. 2
T-statistics comparing expression samples for ER− breast tumors and ER+ breast tumors were calculated for each probe in each study Probes are grouped according to ICC. Good probes have an ICC greater than the 99th percentile of the null distribution, while bad probes have ICC values below that threshold.
Fig. 3
Fig. 3
In each of these plots, na × nb points are laid out in a grid, according to the expression values of the gene in each study. The plotting points vary in size, in proportion to the square of the covariance of each pair of samples, but the same distribution of sizes is seen in each figure. We chose to use the square because it makes the points with the lowest covariances disappear completely, showing the patterns of co-expression for the remaining genes to better effect. The gene shown on the left has an integrative correlation near 0.5, and the largest points tend to fall along the main diagonal. The gene plotted on the right has a correlation near zero. The amount of white space around the margins is the most prominent feature of this plot; this is due both to a lack of points in those regions, and because the samples having the most extreme expression values for this gene in each study are very poorly correlated, so the few points that do fall around the margins are extremely small.
Fig. 4
Fig. 4
Quantile–quantile plots compare the observed null distribution, on the x-axis, to the asymptotic Gaussian null, on the y-axis of each plot in the figure. The number of genes common to the two studies increases by row, from 300 to 3000, while the number of samples in each study increases by column, from 5 to 100.
Fig. 5
Fig. 5
Each heat map shows TCGA lung cancer methylation data. Squamous cell carcinomas are on the left, adenocarcinomas on the right. In each, genes are represented as rows while samples are represented as columns. Methylation level is represented by color intensity on a yellow–blue scale, where bright yellow spots have very low methylation levels (≤20%), while bright blue spots have very high levels of methylation (≥70%). The top row of figures illustrates the difficulty that can arise when standard integrative correlations are used in this data. The genes with the highest integrative correlations, shown here, include many that are completely unmethylated in all samples, in which batch effects too small to be seen on this 4 color scale dominate the correlation structure. The bottom row includes a similar number of genes, again those having the highest integrative correlations, where this time the sample similarity matrix is calculated without scaling genes so that gene contributions are proportional to their variance. The scatterplots in the rightmost column show the relationship between the s.d. of each probe, shown on the horizontal axis, and the integrative correlations on the vertical axis. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 6
Fig. 6
The difference between filtered ICC and ordinary ICC for each gene is plotted on the vertical axis, the standard deviation on the horizontal axis.

References

    1. Balakirev ES, Ayala FJ. Pseudogenes: are they “junk” or functional DNA? Annu. Rev. Genet. 2003;37:123–151. http://dx.doi.org/10.1146/annurev.genet.37.040103.103949. - DOI - PubMed
    1. Boguski MS, Schuler GD. Establishing a human transcript map. Nat. Genet. 1995;10:369–371. - PubMed
    1. Campain A, Yang YH. Comparison study of microarray meta-analysis methods. BMC Bioinformatics. 2010;11:408. http://dx.doi.org/10.1186/1471-2105-11-408. - DOI - PMC - PubMed
    1. Chow GC. A theorem on least squares and vector correlation in multivariate linear regressions. J. Amer. Statist. Assoc. 1966;61(314):413–414.
    1. Cope L, Zhong X, Garrett-Mayer E, Gabrielson E, Parmigiani G. Cross-study validation of the molecular profile of BRCA1-linked breast cancers. Unpublished Manuscript.

LinkOut - more resources