. 2007 Oct 26:8:415.

doi: 10.1186/1471-2105-8-415.

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Michael Gormley¹, William Dampier, Adam Ertel, Bilge Karacali, Aydin Tozeren

Affiliations

PMID: 17963508
PMCID: PMC2211325
DOI: 10.1186/1471-2105-8-415

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Michael Gormley et al. BMC Bioinformatics. 2007.

. 2007 Oct 26:8:415.

doi: 10.1186/1471-2105-8-415.

Authors

Michael Gormley¹, William Dampier, Adam Ertel, Bilge Karacali, Aydin Tozeren

Affiliation

¹ School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA. mpg33@drexel.edu

PMID: 17963508
PMCID: PMC2211325
DOI: 10.1186/1471-2105-8-415

Abstract

Background: Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms.

Results: Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform.

Conclusion: Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels.

PubMed Disclaimer

Figures

**Figure 1**
**Classification of Simulated Gene Expression Data**. Receiver operating characteristic (ROC) curves showing classification performance of DLDA classifiers on simulated gene expression data. The symbols α and β are 1-specificity and sensitivity as described in the Methods section. Solid lines are average ROC curves over 100 iterations of training and test set selection. Dashed lines are empirical 95% confidence intervals. Bar plots give the mean 1-AUC (E) with error bars showing empirical 95% CIs.

**Figure 2**
**Prediction error of DLDA classifiers trained and validated on breast cancer datasets**. Column 1: Classifiers trained on ER-status. Column 2: Classifiers trained on relapse-free status. E is the mean 1-AUC of the corresponding set of ROC curves, calculated as described in the Methods section. Error bars show empirical 95% CIs.

**Figure 3**
**Prediction error of DLDA classifiers trained and validated on a diffuse large B-cell lymphoma dataset**. Column 1: Classifiers trained on BCL-6 status. Column 2: Classifiers trained on relapse-free status. E is the mean 1-AUC of the corresponding set of ROC curves, calculated as described in the Methods section. Error bars show empirical 95% CIs.

**Figure 4**
**Senstivity of classifiers to normalization and machine-learning parameters**. Decision rules trained and validated on breast cancer dataset GSE3494 using supervised feature selection. Row 1: Expression values obtained using different pre-processing algorithms. Row 2: Different univariate feature selection methods. Row 3: Different classification schemes. Row 4: Different mode of partition into training and test data. E is the mean 1-AUC for the corresponding set of ROC curves, calculated as described in the Methods section. Error bars are empirical 95% CIs.

**Figure 5**
**Kaplan-Meier plots of survival rates for predicted tumor classes with different feature selection/cross-validation methods**. Classifiers trained on the basis of relapse-free status on diffuse large B-cell lymphoma dataset GSE4475. Column 1: Signal to noise ratio. Column 2: Ratio of between class to within class sum of squares. Row 1: Leave-one out cross-validation. All data used for training and testing. Row 2: Training and test sets selected randomly from the dataset. Training based on leave-one out cross-validation.

**Figure 6**
**Prediction error of DLDA classifiers on breast cancer datasets by within-dataset and across-dataset cross-validation**. Decision rules trained on ER-status. Ellipses are centered on the mean 1-AUC of the associated ROC curves. The major axis points in the direction of maximum variance. Lengths of the major and minor axes are proportional to the standard deviation of the data in each direction. Column 1: Prediction error of decision rules based on univariate ranking. Column 2: Prediction error of decision rules based on random selection of features from a subset with a priori disease relevance.

See this image and copyright information in PMC

Cited by

Meta-analysis approach as a gene selection method in class prediction: does it improve model performance? A case study in acute myeloid leukemia.
Novianti PW, Jong VL, Roes KC, Eijkemans MJ. Novianti PW, et al. BMC Bioinformatics. 2017 Apr 11;18(1):210. doi: 10.1186/s12859-017-1619-7. BMC Bioinformatics. 2017. PMID: 28399794 Free PMC article.
Development of phenotypic and transcriptional biomarkers to evaluate relative activity of potentially estrogenic chemicals in ovariectomized mice.
Hewitt SC, Winuthayanon W, Pockette B, Kerns RT, Foley JF, Flagler N, Ney E, Suksamrarn A, Piyachaturawat P, Bushel PR, Korach KS. Hewitt SC, et al. Environ Health Perspect. 2015 Apr;123(4):344-52. doi: 10.1289/ehp.1307935. Epub 2015 Jan 9. Environ Health Perspect. 2015. PMID: 25575267 Free PMC article.
Predictive gene lists for breast cancer prognosis: a topographic visualisation study.
Sivaraksa M, Lowe D. Sivaraksa M, et al. BMC Med Genomics. 2008 Apr 17;1:8. doi: 10.1186/1755-8794-1-8. BMC Med Genomics. 2008. PMID: 18419801 Free PMC article.
Modular composition predicts kinase/substrate interactions.
Liu Y, Tozeren A. Liu Y, et al. BMC Bioinformatics. 2010 Jun 25;11:349. doi: 10.1186/1471-2105-11-349. BMC Bioinformatics. 2010. PMID: 20579376 Free PMC article.
Introducing Serine as Cardiovascular Disease Biomarker Candidate via Pathway Analysis.
Rezaei Tavirani M, Zamanian Azodi M, Rostami-Nejad M, Morravej H, Razzaghi Z, Okhovatian F, Rezaei-Tavirani M. Rezaei Tavirani M, et al. Galen Med J. 2020 Feb 10;9:e1696. doi: 10.31661/gmj.v9i0.1696. eCollection 2020. Galen Med J. 2020. PMID: 34466570 Free PMC article.

See all "Cited by" articles

References

1. Chatterjee SK, Zetter BR. Cancer biomarkers: knowing the present and predicting the future. Future Oncol. 2005;1:37–50. doi: 10.1517/14796694.1.1.37. - DOI - PubMed
1. Parissenti AM, Hembruff SL, Villeneuve DJ, Veitch Z, Guo B, Eng J. Gene expression profiles as biomarkers for the prediction of chemotherapy drug response in human tumour cells. Anticancer Drugs. 2007;18:499–523. doi: 10.1097/CAD.0b013e3280262427. - DOI - PubMed
1. Bertucci F, Viens P, Tageet R, Nguyen C, Houlgatte R, Birnbaum D. DNA Arrays in Clinical Oncology: Promises and Challenges. Lab Invest. 2003;83:305–316. - PubMed
1. Patterson SD, Aebersold RH. Proteomics: the first decade and beyond. Nat Genet. 2003;33:311–323. doi: 10.1038/ng1106. - DOI - PubMed
1. Chen YW, Zhao P, Borup R, Hoffman EP. Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology. J Cell Biol. 2000;151:1321–1336. doi: 10.1083/jcb.151.6.1321. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

Grants and funding

232240/PHS HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Affiliation

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources