Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Dec 3;15(12):523.
doi: 10.1186/s13059-014-0523-y.

An investigation of biomarkers derived from legacy microarray data for their utility in the RNA-seq era

An investigation of biomarkers derived from legacy microarray data for their utility in the RNA-seq era

Zhenqiang Su et al. Genome Biol. .

Abstract

Background: Gene expression microarray has been the primary biomarker platform ubiquitously applied in biomedical research, resulting in enormous data, predictive models, and biomarkers accrued. Recently, RNA-seq has looked likely to replace microarrays, but there will be a period where both technologies co-exist. This raises two important questions: Can microarray-based models and biomarkers be directly applied to RNA-seq data? Can future RNA-seq-based predictive models and biomarkers be applied to microarray data to leverage past investment?

Results: We systematically evaluated the transferability of predictive models and signature genes between microarray and RNA-seq using two large clinical data sets. The complexity of cross-platform sequence correspondence was considered in the analysis and examined using three human and two rat data sets, and three levels of mapping complexity were revealed. Three algorithms representing different modeling complexity were applied to the three levels of mappings for each of the eight binary endpoints and Cox regression was used to model survival times with expression data. In total, 240,096 predictive models were examined.

Conclusions: Signature genes of predictive models are reciprocally transferable between microarray and RNA-seq data for model development, and microarray-based models can accurately predict RNA-seq-profiled samples; while RNA-seq-based models are less accurate in predicting microarray-profiled samples and are affected both by the choice of modeling algorithm and the gene mapping complexity. The results suggest continued usefulness of legacy microarray data and established microarray biomarkers and predictive models in the forthcoming RNA-seq era.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flowcharts for evaluating the cross-platform transferability of signature genes and predictive models. Two analysis procedures were applied to evaluate the transferability of signature genes (a) and predictive models (b). In (a), microarray training data are used to develop 500 trained models through (c) to predict the microarray validation samples. The signature genes of each model are then used with the RNA-Seq training data to build an untrained RNA-Seq model using through (d) to predict the RNA-Seq validation samples. The performance of microarray models is finally compared to that of RNA-Seq models. The transferability of signature genes from RNA-Seq back to microarray data can conversely be calculated. While in (b), both microarray and RNA-Seq data were z-scored prior to model development. Then microarray training data are used to develop 500 trained models to predict both microarray and RNA-Seq validation samples. The performance of models in predicting microarray data is compared to that in predicting RNA-Seq data. From RNA-Seq back to microarray is conversely examined. A trained model is developed through (c). Briefly, training samples are randomly split in a 70/30 ratio. For each split, a series of models are developed using the 70% of training samples to predict the remaining 30%. The models are developed as follows: (1) all genes are first filtered with t-test P <0.05 and then ranked by fold change (FC); (2) a sequential forward feature selection by a step of two and parameter selection strategy is then used to build a number of models to predict the remaining samples. Finally, the signature genes and parameters of the best model are used with all training samples to build a trained model. An untrained model is built using all training samples from one platform but with the signature genes and parameters of a model trained from the other platform (d).
Figure 2
Figure 2
Summary of the transferability of signature genes and predictive models between microarray and RNA-Seq data. The test results whether the parameters and signature genes of a model developed from one platform (microarray or RNA-Seq) can be used to build a model using data generated with the other platform (RNA-Seq or microarray) are shown in (a) for the three gene mappings A, B, and C separately; while the results whether a predictive model developed from one platform can be directly used to accurately predict the samples profiled with the other platform for gene mappings A and B are summarized for per sample z-scored data and without per sample z-scored data in (b) and (c), respectively. Green and red arrows indicate the good and bad transferability from one platform to the other, respectively.
Figure 3
Figure 3
The strategy for cross-platform gene mapping and the consistency of cross-platform gene expression measurements. The microarray probes/probe sets are mapped to RNA-Seq genes in one of two ways: public gene ID mapping or genome location mapping (a). Using the gene ID mapping approach requires that one of the following public gene IDs be available: gene symbol, RefSeq transcript ID, Ensembl gene ID, or Entrez gene ID. Using the genome location mapping requires an RNA-Seq gene annotation file in either the Gene Transfer Format (GTF) or the General Feature Format (GFF). The process produces separate mapping lists for microarrays and RNA-Seq. Each of them consists of A, B, C, and D groups. Group A for microarrays corresponds to the group A in RNA-Seq. The microarray group B is a subset of RNA-Seq group C, and vice versa. The D group for microarrays and for RNA-Seq contain genes and probes/probe sets that cannot be mapped between the two platforms. The intensities of Affymetrix microarray probe sets in mapping groups A, B, and C are separately compared to those of RNA-Seq gene counts in panels (b), (c), and (d) for one of the eight RNA samples in the NCTR toxicogenomics data set. The microarray data are from Rat_230_2 arrays normalized with the MAS5 algorithm, and the RNA-Seq reads are from the Illumina GA II platform with the single-end 36 base pairs RNA-Seq protocol and gene counts from the P2 pipeline (Novoalign with RefSeq rat gene models). The mappings from microarray probe sets to RNA-Seq genes are based on the genome location mapping approach.
Figure 4
Figure 4
The percentages of probe sets in mapping groups A, B, C, and D. The percentages of Affymetrix probe sets in four mapping groups A, B, C, and D for the six RNA-Seq gene sets are shown in stacked bar charts. The data set comprises 62 Affymetrix Rat_230_2 arrays and 62 RNA-Seq assays from the same set of 62 rat liver RNA samples. The microarray data were normalized with MAS5, and the same RNA-Seq raw data were analyzed by six independent data analysis teams with a variety of analysis pipelines, that is, P1 (NCBI Magic), P2 (Novoalign with RefSeq gene models), P3 (Bwa + RefSeq RNAs), P4 (Tophat + HTSeq with RefSeq gene models), P5 (Bowtie + RSEM with Ensembl gene models), and P6 (Tophat + cufflinks de novo assembly). The Affymetrix probe sets (31,099 in total) were separately mapped to the six RNA-Seq gene sets. The mappings to P1, P2, P3, and P4 gene sets are based on the gene ID mapping approach, while mappings to P5 and P6 gene sets are based on the genome location mapping.
Figure 5
Figure 5
A performance comparison of k-nearest neighbors (k-NN) models and their corresponding transferred models. The comparison is based on the SEQC NB data set. For each of the six binary clinical endpoints and each of the three mapping groups A, B, and C, a set of 500 k-NN models were developed from microarray training data and used to predict microarray validation samples. The k parameter and signature genes of each of the 500 microarray models were then used with all RNA-Seq training data for those genes to build an untrained RNA-Seq model to predict RNA-Seq validation samples. Finally, the average prediction accuracies of the 500 microarray models are plotted against those of the 500 corresponding RNA-Seq models (a), with the per sample agreement better than chance given by the Kappa statistic as shown in (b). The transferability of the signature genes from RNA-Seq back to microarray data was conversely calculated. The 500 k-NN models trained from RNA-Seq data were used to predict RNA-Seq validation samples. Then the k parameter and signature genes of each RNA-Seq model were used with all microarray training data for those genes to build a microarray model to predict microarray validation samples. The average accuracies of the 500 RNA-Seq models are compared to those of the 500 corresponding microarray models (c), with the per sample agreement better than chance given by the Kappa statistic as shown in (d). The six symbols in each panel represent the six binary clinical endpoints with green, blue, and orange colors denoting mapping groups A, B, and C, respectively. In panels (b) and (d), each symbol denotes the average Kappa statistic for the 500 pairs of k-NNs models; and each error bar shows the 95% confidence interval (CI) for the mean Kappa statistic. Each CI was calculated with the bootstrap estimation.
Figure 6
Figure 6
A performance comparison of k-nearest neighbors (k-NN) in predicting microarray and RNA-Seq validation samples. The comparison is based on the SEQC NB data set. In the comparison, both microarray log2 intensity data and RNA-Seq log2 counts were per sample z-scored. For each of the six binary clinical endpoints and each of the two mapping groups A and B, a set of 500 k-NN models were developed from microarray and RNA-Seq training data independently. Each set of k-NN models were then used to predict both microarray and RNA-Seq validation samples. The average prediction accuracies of the 500 microarray k-NN models in predicting microarray data are plotted against those in predicting RNA-Seq data (a), with the per sample agreement better than chance evaluated with the Kappa statistic as shown in (b); while the average accuracies of the 500 RNA-Seq k-NN models in predicting RNA-Seq data are compared to those in predicting microarray data (c), with the per sample agreement better than chance assessed with the Kappa statistic as shown in (d). The six symbols in each panel represent the six binary clinical endpoints with green and blue colors denoting mapping groups A and B, respectively. In panels (b) and (d), each symbol denotes the average Kappa statistic of the 500 pairs of prediction results; and each error bar shows the 95% confidence interval (CI) for the mean Kappa statistic. Each CI was calculated with the bootstrap estimation.
Figure 7
Figure 7
A performance comparison of k-nearest neighbors (k-NN) models and their corresponding transferred models based on the TCGA AML data. For each of the two binary clinical endpoints and each of the three mapping groups A, B, and C, a set of 500 k-NN models were developed from microarray training data and used to predict microarray validation samples. The signature genes of each of the 500 microarray models were then used with all RNA-Seq training data for those genes to build an untrained RNA-Seq model to predict RNA-Seq validation samples. Finally, the average prediction accuracies of the 500 microarray models are plotted against those of the 500 corresponding RNA-Seq models (a), with the per sample agreement better than chance evaluated with the Kappa statistic as shown in (b). The transferability of the signature genes from RNA-Seq back to microarray data was conversely calculated. The 500 k-NN models trained from RNA-Seq data were used to predict RNA-Seq validation samples. Then the signature genes of each RNA-Seq model were used with all microarray training data for those genes to build an untrained k-NN model to predict microarray validation samples. The average accuracies of the 500 RNA-Seq models were then compared to those of the 500 corresponding microarray models (c), with the per sample agreement better than chance assessed with the Kappa statistic as shown in (d). The two symbols in each panel represent the two binary clinical endpoints with green, blue, and orange colors denoting mapping groups A, B, and C, respectively. In panels (b) and (d), each symbol denotes the average Kappa statistic of the 500 pairs of model predictions; and each error bar shows the 95% confidence interval (CI) for the mean Kappa statistic. Each CI was calculated with the bootstrap estimation. No significant difference is observed between trained microarrays models and transferred RNA-Seq models (paired t-test P is 0.366) and between the trained RNA-Seq models and the transferred microarray models (paired t-test P is 0.269).
Figure 8
Figure 8
A performance comparison of k-nearest neighbors (k-NN) models in predicting microarray and RNA-Seq validation data based on the TCGA AML data. In the comparison, both microarray log2 intensity and RNA-Seq log2 count were per sample z-scored. For each of the two binary clinical endpoints and each of the two mapping groups A and B, a set of 500 k-NN models were developed from microarray and RNA-Seq training data independently. Each set of k-NN models were then used to predict both microarray and RNA-Seq validation samples. The average prediction accuracies of the 500 microarray-based models in prediction microarray data were plotted against those in predicting RNA-Seq data (a), with per sample agreement better than chance assessed with the Kappa statistic as shown in (b); while the average accuracies of the 500 RNA-Seq-based models in predicting RNA-Seq data were compared to those in predicting microarray data (c), with per sample agreement better than chance evaluated with the Kappa statistic as shown in (d). The two symbols in each panel represent the two binary clinical endpoints with green and blue colors denoting mapping groups A and B, respectively. In panels (b) and (d), each symbol denotes the average Kappa statistic of 500 pairs of prediction results; and each error bar shows the 95% confidence interval (CI) for the mean Kappa statistic. Each CI was calculated with the bootstrap estimation.

References

    1. Michnick SW. The connectivity map. Nat Chem Biol. 2006;2:663–664. doi: 10.1038/nchembio1206-663. - DOI - PubMed
    1. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, Reich M, Hieronymus H, Wei G, Armstrong SA, Haggarty SJ, Clemons PA, Wei R, Carr SA, Lander ES, Golub TR. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–1935. doi: 10.1126/science.1132939. - DOI - PubMed
    1. Waters M, Stasiewicz S, Merrick BA, Tomer K, Bushel P, Paules R, Stegman N, Nehls G, Yost KJ, Johnson CH, Gustafson SF, Xirasagar S, Xiao N, Huang CC, Boyer P, Chan DD, Pan Q, Gong H, Taylor J, Choi D, Rashid A, Ahmed A, Howle R, Selkirk J, Tennant R, Fostel J. CEBS–Chemical Effects in Biological Systems: a public data repository integrating study design and toxicity data with microarray and proteomics data. Nucleic Acids Res. 2008;36:D892–D900. doi: 10.1093/nar/gkm755. - DOI - PMC - PubMed
    1. Ganter B, Snyder RD, Halbert DN, Lee MD. Toxicogenomics in drug discovery and development: mechanistic analysis of compound/class-dependent effects using the DrugMatrix database. Pharmacogenomics. 2006;7:1025–1044. doi: 10.2217/14622416.7.7.1025. - DOI - PubMed
    1. Kiyosawa N, Manabe S, Yamoto T, Sanbuissho A. Practical application of toxicogenomics for profiling toxicant-induced biological perturbations. Int J Mol Sci. 2010;11:3397–3412. doi: 10.3390/ijms11093397. - DOI - PMC - PubMed

Publication types