Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2012 Nov 15;18(22):6136-46.
doi: 10.1158/1078-0432.CCR-12-1915. Epub 2012 Nov 7.

Expression profiling of archival tumors for long-term health studies

Affiliations
Meta-Analysis

Expression profiling of archival tumors for long-term health studies

Levi Waldron et al. Clin Cancer Res. .

Abstract

Purpose: More than 20 million archival tissue samples are stored annually in the United States as formalin-fixed, paraffin-embedded (FFPE) blocks, but RNA degradation during fixation and storage has prevented their use for transcriptional profiling. New and highly sensitive assays for whole-transcriptome microarray analysis of FFPE tissues are now available, but resulting data include noise and variability for which previous expression array methods are inadequate.

Experimental design: We present the two largest whole-genome expression studies from FFPE tissues to date, comprising 1,003 colorectal cancer (CRC) and 168 breast cancer samples, combined with a meta-analysis of 14 new and published FFPE microarray datasets. We develop and validate quality control (QC) methods through technical replication, independent samples, comparison to results from fresh-frozen tissue, and recovery of expected associations between gene expression and protein abundance.

Results: Archival tissues from large, multicenter studies showed a much wider range of transcriptional data quality relative to smaller or frozen tissue studies and required stringent QC for subsequent analysis. We developed novel methods for such QC of archival tissue expression profiles based on sample dynamic range and per-study median profile. This enabled validated identification of gene signatures of microsatellite instability and additional features of CRC, and improved recovery of associations between gene expression and protein abundance of MLH1, FASN, CDX2, MGMT, and SIRT1 in CRC tumors.

Conclusions: These methods for large-scale QC of FFPE expression profiles enable study of the cancer transcriptome in relation to extensive clinicopathological information, tumor molecular biomarkers, and long-term lifestyle and outcome data.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1. Novel quality control methodology for gene expression data from archival tissues, validated using a technical study of primary and metastatic breast tumors and of autopsy tissue samples
(A) Raw gene expression intensities from studies employing FFPE tissues often possess highly variable dynamic ranges, which can serve as a reliable quality control measure. Lines indicate the 25th to 75th percentile of log-scaled transcript levels (interquartile range, IQR) for 117 samples passing preliminary quality control (10% of features present at p<0.01), sorted from lowest to highest IQR. (B) In order to detect poor-quality samples, each sample IQR is compared with its Spearman correlation to the study median. A combination of low resulting correlation with low IQR identifies poor quality data, which we specifically threshold at the point of maximum downward inflection of a Loess smoothing line. Autopsy samples (labelled “x”) are disproportionately of low quality. (C) The BC/A study included 44 technical replicates; we applied this IQR criterion to the samples, here showing the minimum IQR of each replicate pair compared with Spearman correlation of the replicates. Low reproducibility is seen between replicates falling below our IQR-based quality control threshold.
Figure 2
Figure 2. Population-scale gene expression studies of archival tissue samples possess wide dynamic ranges that must be quality controlled
Dynamic ranges of expression intensities from 20 datasets in 14 independent studies are shown; these studies are summarized in Table 1. Two y-axes are used, for the Illumina and Affymetrix platforms, as these technologies provide intensities on different scales that should not be directly compared and do not indicate a between-platform quality difference. Box width is proportional to the square root of study sample size; box limits indicate per-study interquartile range of the individual, per-sample interquartile ranges of intensities for samples in that study. Thus short boxes indicate relatively uniform sample quality and dynamic range, while taller boxes indicate studies containing a wide range of data quality. Smaller, focused studies are often of atypically high quality and do not capture the range of sample qualities observed in population-scale studies of diverse clinical samples, as seen in these two large studies of breast cancer and autopsy samples (BC/A) and colorectal tumors from the NHS/HPFS prospective cohorts (CRC).
Figure 3
Figure 3. Concordance At the Top boxplot demonstrating reproducibility of identification of genes differentially expressed between colon and rectum, with varying quality control and data pre-processing approaches
Even under ideal conditions, the identification of any “top-n” gene list differentially expressed with respect to a phenotype is affected by random sampling variation. This plot quantifies how much our quality control measures improve such lists over randomly selected gene lists as a baseline. We identified 330 genes associated with CRC by inclusion in at least two CRC gene signatures from the geneSigDB database(28). We ranked these genes by fold-change between colon and rectum tumor location in two equal, independent subsets of the CRC cohort. The fractional overlap (concordance) of the top n genes in each list was calculated as a function of nand the process repeated for 100 random splits of the samples to obtain the distribution of concordances shown in the boxplot. The diagonal dashed line indicates the baseline concordance of random gene lists. The distance above this line thus indicates the reproducibility of differentially expressed gene identification for poor samples only (rejected by our QC), permissive QC (samples with IQR>0), strict QC (threshold automatically determined as in Figure 1), and probe normalization (Illumina VST+RSN(29) and imputation(30)) schemes. IQR-based quality control removes samples that independently show no reproducibility and improves reproducibility from the remaining samples, whereas alternative probe normalization provided modest benefit beyond quantile normalization. We observed similar results for other CRC tumor phenotypes (Supplemental Figure S5), emphasizing the importance of strict QC for reproducible biological inference.
Figure 4
Figure 4. Reproducibility of individual probe intensity measurements can be assessed by dynamic range across samples
Nine pairs of replicate samples passing quality control in the BC/A study were quantile normalized independently, and one set of replicates was used to bin probes by standard deviation across samples. Correlation of these values was then assessed in the second set of replicates. Probes with higher standard deviation of expression values showed correspondingly higher Spearman correlation between technical replicates, indicating that removal of low variance probes prior to biological analyses can improve measurement reproducibility.
Figure 5
Figure 5. Sample- and probe-level quality control methods improve the accuracy and reproducibility of genes differentially expressed with respect to the microsatellite instability phenotype
(A) Identification of previously established microsatellite instability (MSI) associated genes in whole-genome differential expression analysis improves with strict IQR-based QC. Top differentially expressed genes in the NHS/HPFS samples are well-identified independently of QC, but the detection of more moderately differentially expressed genes is improved by strict QC. (B) Identification (by t-statistic) of 43 published MSI-High and MSI-Low associated genes(32) improves as probe standard deviation is used for quality control in NHS/HPFS samples. High-variance probes and higher dynamic range samples thus not only show better technical reproducibility, but are also more likely to provide differential expression concordant with independent, fresh-frozen tissues.

Similar articles

Cited by

References

    1. Williams PM, Li R, Johnson NA, Wright G, Heath J-D, Gascoyne RD. A Novel Method of Amplification of FFPET-Derived RNA Enables Accurate Disease Classification with Microarrays. The Journal of Molecular Diagnostics. 2010;12:680–686. - PMC - PubMed
    1. Ogino S, Chan A, Fuchs C, Giovannucci E. Molecular pathological epidemiology of colorectal neoplasia: an emerging transdisciplinary and interdisciplinary field. Gut. 2011;60:397–808. - PMC - PubMed
    1. Ogino S, Galon J, Fuchs C, Dranoff G. Cancer immunology--analysis of host and tumor factors for personalized medicine. Nature reviews Clinical oncology. 2011;8:711–720. - PMC - PubMed
    1. Lewis F, Maughan NJ, Smith V, Hillan K, Quirke P. Unlocking the archive – gene expression in paraffin-embedded tissue. The Journal of Pathology. 2001;195:66–71. - PubMed
    1. Reinholz MM, Eckel-Passow JE, Anderson SK, Asmann YW, Zschunke MA, Oberg AL, et al. Expression profiling of formalin-fixed paraffin-embedded primary breast tumors using cancer-specific and whole genome gene panels on the DASL platform. BMC Medical Genomics. 2010;3:60. - PMC - PubMed

Publication types