Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Oct 26:8:415.
doi: 10.1186/1471-2105-8-415.

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Affiliations

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Michael Gormley et al. BMC Bioinformatics. .

Abstract

Background: Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms.

Results: Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform.

Conclusion: Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Classification of Simulated Gene Expression Data. Receiver operating characteristic (ROC) curves showing classification performance of DLDA classifiers on simulated gene expression data. The symbols α and β are 1-specificity and sensitivity as described in the Methods section. Solid lines are average ROC curves over 100 iterations of training and test set selection. Dashed lines are empirical 95% confidence intervals. Bar plots give the mean 1-AUC (E) with error bars showing empirical 95% CIs.
Figure 2
Figure 2
Prediction error of DLDA classifiers trained and validated on breast cancer datasets. Column 1: Classifiers trained on ER-status. Column 2: Classifiers trained on relapse-free status. E is the mean 1-AUC of the corresponding set of ROC curves, calculated as described in the Methods section. Error bars show empirical 95% CIs.
Figure 3
Figure 3
Prediction error of DLDA classifiers trained and validated on a diffuse large B-cell lymphoma dataset. Column 1: Classifiers trained on BCL-6 status. Column 2: Classifiers trained on relapse-free status. E is the mean 1-AUC of the corresponding set of ROC curves, calculated as described in the Methods section. Error bars show empirical 95% CIs.
Figure 4
Figure 4
Senstivity of classifiers to normalization and machine-learning parameters. Decision rules trained and validated on breast cancer dataset GSE3494 using supervised feature selection. Row 1: Expression values obtained using different pre-processing algorithms. Row 2: Different univariate feature selection methods. Row 3: Different classification schemes. Row 4: Different mode of partition into training and test data. E is the mean 1-AUC for the corresponding set of ROC curves, calculated as described in the Methods section. Error bars are empirical 95% CIs.
Figure 5
Figure 5
Kaplan-Meier plots of survival rates for predicted tumor classes with different feature selection/cross-validation methods. Classifiers trained on the basis of relapse-free status on diffuse large B-cell lymphoma dataset GSE4475. Column 1: Signal to noise ratio. Column 2: Ratio of between class to within class sum of squares. Row 1: Leave-one out cross-validation. All data used for training and testing. Row 2: Training and test sets selected randomly from the dataset. Training based on leave-one out cross-validation.
Figure 6
Figure 6
Prediction error of DLDA classifiers on breast cancer datasets by within-dataset and across-dataset cross-validation. Decision rules trained on ER-status. Ellipses are centered on the mean 1-AUC of the associated ROC curves. The major axis points in the direction of maximum variance. Lengths of the major and minor axes are proportional to the standard deviation of the data in each direction. Column 1: Prediction error of decision rules based on univariate ranking. Column 2: Prediction error of decision rules based on random selection of features from a subset with a priori disease relevance.

Similar articles

Cited by

References

    1. Chatterjee SK, Zetter BR. Cancer biomarkers: knowing the present and predicting the future. Future Oncol. 2005;1:37–50. doi: 10.1517/14796694.1.1.37. - DOI - PubMed
    1. Parissenti AM, Hembruff SL, Villeneuve DJ, Veitch Z, Guo B, Eng J. Gene expression profiles as biomarkers for the prediction of chemotherapy drug response in human tumour cells. Anticancer Drugs. 2007;18:499–523. doi: 10.1097/CAD.0b013e3280262427. - DOI - PubMed
    1. Bertucci F, Viens P, Tageet R, Nguyen C, Houlgatte R, Birnbaum D. DNA Arrays in Clinical Oncology: Promises and Challenges. Lab Invest. 2003;83:305–316. - PubMed
    1. Patterson SD, Aebersold RH. Proteomics: the first decade and beyond. Nat Genet. 2003;33:311–323. doi: 10.1038/ng1106. - DOI - PubMed
    1. Chen YW, Zhao P, Borup R, Hoffman EP. Expression profiling in the muscular dystrophies: identification of novel aspects of molecular pathophysiology. J Cell Biol. 2000;151:1321–1336. doi: 10.1083/jcb.151.6.1321. - DOI - PMC - PubMed

Publication types

LinkOut - more resources