Why have so few proteomic biomarkers "survived" validation? (Sample size and independent validation considerations)

Belinda Hernández¹, Andrew Parnell, Stephen R Pennington

Affiliations

Affiliation

¹ Complex and Adaptive Systems Laboratory, School of Mathematical Sciences (Statistics), University College Dublin, Dublin, Ireland; School of Medicine and Medical Science, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland.

PMID: 24737731
DOI: 10.1002/pmic.201300377

Why have so few proteomic biomarkers "survived" validation? (Sample size and independent validation considerations)

Belinda Hernández et al. Proteomics. 2014 Jul.

. 2014 Jul;14(13-14):1587-92.

doi: 10.1002/pmic.201300377. Epub 2014 May 16.

Authors

Belinda Hernández¹, Andrew Parnell, Stephen R Pennington

Affiliation

¹ Complex and Adaptive Systems Laboratory, School of Mathematical Sciences (Statistics), University College Dublin, Dublin, Ireland; School of Medicine and Medical Science, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland.

PMID: 24737731
DOI: 10.1002/pmic.201300377

Abstract

Proteomic biomarker discovery has led to the identification of numerous potential candidates for disease diagnosis, prognosis, and prediction of response to therapy. However, very few of these identified candidate biomarkers reach clinical validation and go on to be routinely used in clinical practice. One particular issue with biomarker discovery is the identification of significantly changing proteins in the initial discovery experiment that do not validate when subsequently tested on separate patient sample cohorts. Here, we seek to highlight some of the statistical challenges surrounding the analysis of LC-MS proteomic data for biomarker candidate discovery. We show that common statistical algorithms run on data with low sample sizes can overfit and yield misleading misclassification rates and AUC values. A common solution to this problem is to prefilter variables (via, e.g. ANOVA and or use of correction methods such as Bonferonni or false discovery rate) to give a smaller dataset and reduce the size of the apparent statistical challenge. However, we show that this exacerbates the problem yielding even higher performance metrics while reducing the predictive accuracy of the biomarker panel. To illustrate some of these limitations, we have run simulation analyses with known biomarkers. For our chosen algorithm (random forests), we show that the above problems are substantially reduced if a sufficient number of samples are analyzed and the data are not prefiltered. Our view is that LC-MS proteomic biomarker discovery data should be analyzed without prefiltering and that increasing the sample size in biomarker discovery experiments should be a very high priority.

Keywords: Bioinformatics; Biomarker panels; Cross-validation; Proteomic discovery; Random forest; Sample size.

PubMed Disclaimer

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
- Wiley
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Why have so few proteomic biomarkers "survived" validation? (Sample size and independent validation considerations)

Affiliation

Why have so few proteomic biomarkers "survived" validation? (Sample size and independent validation considerations)

Authors

Affiliation

Abstract

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources