Integrative correlation: Properties and relation to canonical correlations

Leslie Cope¹, Daniel Q Naiman², Giovanni Parmigiani³

Affiliations

¹ The Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University School of Medicine, United States.
² Department of Applied Math and Statistics, Whiting School of Engineering, The Johns Hopkins University, United States.
³ Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, United States ; Department of Biostatistics, Harvard School of Public Health, United States.

PMID: 26028790
PMCID: PMC4447241
DOI: 10.1016/j.jmva.2013.09.011

Integrative correlation: Properties and relation to canonical correlations

Leslie Cope et al. J Multivar Anal. 2014.

. 2014 Jan 1:123:270-280.

doi: 10.1016/j.jmva.2013.09.011.

Authors

Leslie Cope¹, Daniel Q Naiman², Giovanni Parmigiani³

Affiliations

¹ The Sidney Kimmel Comprehensive Cancer Center, The Johns Hopkins University School of Medicine, United States.
² Department of Applied Math and Statistics, Whiting School of Engineering, The Johns Hopkins University, United States.
³ Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute, United States ; Department of Biostatistics, Harvard School of Public Health, United States.

PMID: 26028790
PMCID: PMC4447241
DOI: 10.1016/j.jmva.2013.09.011

Abstract

The integrative correlation coefficient was developed to facilitate the validation of expression microarray results in public datasets, by identifying genes that are reproducibly measured across studies and even across microarray platforms. In the current study, we develop a number of interesting and important mathematical and statistical properties of the integrative correlation coefficient, including a unique permutation-based null distribution with the unusual property that the variance does not shrink as the sample size increases, discussing how these findings impact its use and interpretation, and what they have to say about any method for identifying reproducible genes in a meta-analysis.

Keywords: Bioinformatics; Correlation; Cross-study validation; Gene expression; Reproducibility; Statistics.

PubMed Disclaimer

Figures

**Fig. 1**
The distribution of null integrative correlations are plotted in blue, with the 99th percentile marked as a vertical line. The observed integrative correlations are plotted in red by simulation group, and show the expected decline as the probes used in each study become more independent.

**Fig. 2**
T-statistics comparing expression samples for ER− breast tumors and ER+ breast tumors were calculated for each probe in each study Probes are grouped according to ICC. Good probes have an ICC greater than the 99th percentile of the null distribution, while bad probes have ICC values below that threshold.

**Fig. 3**
In each of these plots, *n_a* × *n_b* points are laid out in a grid, according to the expression values of the gene in each study. The plotting points vary in size, in proportion to the square of the covariance of each pair of samples, but the same distribution of sizes is seen in each figure. We chose to use the square because it makes the points with the lowest covariances disappear completely, showing the patterns of co-expression for the remaining genes to better effect. The gene shown on the left has an integrative correlation near 0.5, and the largest points tend to fall along the main diagonal. The gene plotted on the right has a correlation near zero. The amount of white space around the margins is the most prominent feature of this plot; this is due both to a lack of points in those regions, and because the samples having the most extreme expression values for this gene in each study are very poorly correlated, so the few points that do fall around the margins are extremely small.

**Fig. 4**
Quantile–quantile plots compare the observed null distribution, on the x-axis, to the asymptotic Gaussian null, on the y-axis of each plot in the figure. The number of genes common to the two studies increases by row, from 300 to 3000, while the number of samples in each study increases by column, from 5 to 100.

**Fig. 5**
Each heat map shows TCGA lung cancer methylation data. Squamous cell carcinomas are on the left, adenocarcinomas on the right. In each, genes are represented as rows while samples are represented as columns. Methylation level is represented by color intensity on a yellow–blue scale, where bright yellow spots have very low methylation levels (≤20%), while bright blue spots have very high levels of methylation (≥70%). The top row of figures illustrates the difficulty that can arise when standard integrative correlations are used in this data. The genes with the highest integrative correlations, shown here, include many that are completely unmethylated in all samples, in which batch effects too small to be seen on this 4 color scale dominate the correlation structure. The bottom row includes a similar number of genes, again those having the highest integrative correlations, where this time the sample similarity matrix is calculated without scaling genes so that gene contributions are proportional to their variance. The scatterplots in the rightmost column show the relationship between the s.d. of each probe, shown on the horizontal axis, and the integrative correlations on the vertical axis. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 6**
The difference between filtered ICC and ordinary ICC for each gene is plotted on the vertical axis, the standard deviation on the horizontal axis.

See this image and copyright information in PMC

References

1. Balakirev ES, Ayala FJ. Pseudogenes: are they “junk” or functional DNA? Annu. Rev. Genet. 2003;37:123–151. http://dx.doi.org/10.1146/annurev.genet.37.040103.103949. - DOI - PubMed
1. Boguski MS, Schuler GD. Establishing a human transcript map. Nat. Genet. 1995;10:369–371. - PubMed
1. Campain A, Yang YH. Comparison study of microarray meta-analysis methods. BMC Bioinformatics. 2010;11:408. http://dx.doi.org/10.1186/1471-2105-11-408. - DOI - PMC - PubMed
1. Chow GC. A theorem on least squares and vector correlation in multivariate linear regressions. J. Amer. Statist. Assoc. 1966;61(314):413–414.
1. Cope L, Zhong X, Garrett-Mayer E, Gabrielson E, Parmigiani G. Cross-study validation of the molecular profile of BRCA1-linked breast cancers. Unpublished Manuscript.

Grants and funding

P30 CA006973/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrative correlation: Properties and relation to canonical correlations

Affiliations

Integrative correlation: Properties and relation to canonical correlations

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources