. 2009 Dec 30:10:453.

doi: 10.1186/1471-2105-10-453.

Classification across gene expression microarray studies

Andreas Buness¹, Markus Ruschhaupt, Ruprecht Kuner, Achim Tresch

Affiliations

PMID: 20042109
PMCID: PMC2811711
DOI: 10.1186/1471-2105-10-453

Classification across gene expression microarray studies

Andreas Buness et al. BMC Bioinformatics. 2009.

. 2009 Dec 30:10:453.

doi: 10.1186/1471-2105-10-453.

Authors

Andreas Buness¹, Markus Ruschhaupt, Ruprecht Kuner, Achim Tresch

Affiliation

¹ German Cancer Research Center (DKFZ), Department of Molecular Genome Analysis, 69120 Heidelberg, Germany. a.buness@gmx.de

PMID: 20042109
PMCID: PMC2811711
DOI: 10.1186/1471-2105-10-453

Abstract

Background: The increasing number of gene expression microarray studies represents an important resource in biomedical research. As a result, gene expression based diagnosis has entered clinical practice for patient stratification in breast cancer. However, the integration and combined analysis of microarray studies remains still a challenge. We assessed the potential benefit of data integration on the classification accuracy and systematically evaluated the generalization performance of selected methods on four breast cancer studies comprising almost 1000 independent samples. To this end, we introduced an evaluation framework which aims to establish good statistical practice and a graphical way to monitor differences. The classification goal was to correctly predict estrogen receptor status (negative/positive) and histological grade (low/high) of each tumor sample in an independent study which was not used for the training. For the classification we chose support vector machines (SVM), predictive analysis of microarrays (PAM), random forest (RF) and k-top scoring pairs (kTSP). Guided by considerations relevant for classification across studies we developed a generalization of kTSP which we evaluated in addition. Our derived version (DV) aims to improve the robustness of the intrinsic invariance of kTSP with respect to technologies and preprocessing.

Results: For each individual study the generalization error was benchmarked via complete cross-validation and was found to be similar for all classification methods. The misclassification rates were substantially higher in classification across studies, when each single study was used as an independent test set while all remaining studies were combined for the training of the classifier. However, with increasing number of independent microarray studies used in the training, the overall classification performance improved. DV performed better than the average and showed slightly less variance. In particular, the better predictive results of DV in across platform classification indicate higher robustness of the classifier when trained on single channel data and applied to gene expression ratios.

Conclusions: We present a systematic evaluation of strategies for the integration of independent microarray studies in a classification task. Our findings in across studies classification may guide further research aiming on the construction of more robust and reliable methods for stratification and diagnosis in clinical practice.

PubMed Disclaimer

Figures

**Figure 1**
**Schematic overview of the systematic approach to assess the classification performance across independent data sets**. The role of each data set is exhaustively alternated between training and testing. N equals 4 in case of the estrogen receptor status and equals 3 in case of the histological grade, see Table 1.

**Figure 2**
**The misclassification error for each of the five classification methods and each of the studies (N = 4) is shown for the estrogen receptor status**. The plotted numbers in distinct colors indicate the study as listed in Table 1 while pointing to the corresponding misclassification rate. A: The misclassification rate was estimated with complete cross-validation in each study separately. B: The misclassification rate is shown for each training set combination subgrouped by the number of studies used in the training (see Figure 1). Dotted lines indicate averages across classification methods.

**Figure 3**
**The misclassification error for each of the five classification methods and each of the studies (N = 3) is shown for histological grade**. The plotted numbers in distinct colors indicate the study as listed in Table 1 while pointing to the corresponding misclassification rate. A: The misclassification rate was estimated with complete cross-validation in each study separately. B: The misclassification rate is shown for each training set combination subgrouped by the number of studies used in the training (see Figure 1). Dotted lines indicate averages across classification methods.

**Figure 4**
The figure summarizes the main classification results while detailing class, data set and sample specific prediction performance (A: estrogen receptor status in four studies; B: histological grade in three studies). Samples correspond to rows and methods to columns. The estimates of the cross-validation approach are shown on the left separated by a vertical line from the results of the classification across studies on the right where the number of training sets was maximal (A:3, B:2). The graphical representation is similar to a heatmap. The area corresponding to a misclassified sample is labelled in red and in light yellow for a correctly classified sample. The error estimates of the repeated cross-validation have been mapped to the range from red to light yellow for each individual sample. The cross-validation approach was run separately for each study. For the classification across studies the results are shown in which all studies except the one used for assessment formed the training set (see Additional File 3 and 4 for the results of all training set combinations). Samples are ordered by study, class, their average misclassification rate in the cross-validation and classification across studies. The color code at the left indicates the study (green = 1, blue = 2, red = 3, orange = 4), at the right the class (A: green = ER-, orange = ER+; B: green = G1, orange = G3).

See this image and copyright information in PMC

Cited by

Effect of size and heterogeneity of samples on biomarker discovery: synthetic and real data assessment.
Di Camillo B, Sanavia T, Martini M, Jurman G, Sambo F, Barla A, Squillario M, Furlanello C, Toffolo G, Cobelli C. Di Camillo B, et al. PLoS One. 2012;7(3):e32200. doi: 10.1371/journal.pone.0032200. Epub 2012 Mar 5. PLoS One. 2012. PMID: 22403633 Free PMC article.
Bayesian multi-source regression and monocyte-associated gene expression predict BCL-2 inhibitor resistance in acute myeloid leukemia.
White BS, Khan SA, Mason MJ, Ammad-Ud-Din M, Potdar S, Malani D, Kuusanmäki H, Druker BJ, Heckman C, Kallioniemi O, Kurtz SE, Porkka K, Tognon CE, Tyner JW, Aittokallio T, Wennerberg K, Guinney J. White BS, et al. NPJ Precis Oncol. 2021 Jul 23;5(1):71. doi: 10.1038/s41698-021-00209-9. NPJ Precis Oncol. 2021. PMID: 34302041 Free PMC article.
Multiple-platform data integration method with application to combined analysis of microarray and proteomic data.
Wu S, Xu Y, Feng Z, Yang X, Wang X, Gao X. Wu S, et al. BMC Bioinformatics. 2012 Dec 2;13:320. doi: 10.1186/1471-2105-13-320. BMC Bioinformatics. 2012. PMID: 23198695 Free PMC article.
Improving biomarker list stability by integration of biological knowledge in the learning process.
Sanavia T, Aiolli F, Da San Martino G, Bisognin A, Di Camillo B. Sanavia T, et al. BMC Bioinformatics. 2012 Mar 28;13 Suppl 4(Suppl 4):S22. doi: 10.1186/1471-2105-13-S4-S22. BMC Bioinformatics. 2012. PMID: 22536969 Free PMC article.
A Comparison of Logistic Regression, Logic Regression, Classification Tree, and Random Forests to Identify Effective Gene-Gene and Gene-Environmental Interactions.
Yoo W, Ference BA, Cote ML, Schwartz A. Yoo W, et al. Int J Appl Sci Technol. 2012 Aug;2(7):268. Int J Appl Sci Technol. 2012. PMID: 23795347 Free PMC article.

See all "Cited by" articles

References

1. Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sorlie T, Dai H, He YD, van't Veer LJ, Bartelink H, Rijn M van de, Brown PO, Vijver MJ van de. Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. Proc Natl Acad Sci USA. 2005;102(10):3738–43. doi: 10.1073/pnas.0409462102. - DOI - PMC - PubMed
1. Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA. 2005;102(38):13550–5. doi: 10.1073/pnas.0506230102. - DOI - PMC - PubMed
1. van't Veer LJ, Dai H, Vijver MJ van de, He YD, Hart AAM, Mao M, Peterse HL, Kooy K van der, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–536. doi: 10.1038/415530a. - DOI - PubMed
1. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, van Gelder MEM, Yu J, Jatkoe T, Berns EMJJ, Atkins D, Foekens JA. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005;365(9460):671–9. - PubMed
1. Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365(9458):488–92. doi: 10.1016/S0140-6736(05)17866-0. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Classification across gene expression microarray studies

Affiliation

Classification across gene expression microarray studies

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous