Review

. 2016:919:463-492.

doi: 10.1007/978-3-319-41448-5_22.

Statistical Approaches to Candidate Biomarker Panel Selection

Heidi M Spratt¹, Hyunsu Ju²

Affiliations

¹ The University of Texas Medical Branch, 301 University Blvd, Galveston, TX, 77555-1148, USA. hespratt@utmb.edu.
² The University of Texas Medical Branch, 301 University Blvd, Galveston, TX, 77555-1148, USA.

PMID: 27975231
PMCID: PMC7885896
DOI: 10.1007/978-3-319-41448-5_22

Review

Statistical Approaches to Candidate Biomarker Panel Selection

Heidi M Spratt et al. Adv Exp Med Biol. 2016.

. 2016:919:463-492.

doi: 10.1007/978-3-319-41448-5_22.

Authors

Heidi M Spratt¹, Hyunsu Ju²

Affiliations

¹ The University of Texas Medical Branch, 301 University Blvd, Galveston, TX, 77555-1148, USA. hespratt@utmb.edu.
² The University of Texas Medical Branch, 301 University Blvd, Galveston, TX, 77555-1148, USA.

PMID: 27975231
PMCID: PMC7885896
DOI: 10.1007/978-3-319-41448-5_22

Abstract

The statistical analysis of robust biomarker candidates is a complex process, and is involved in several key steps in the overall biomarker development pipeline (see Fig. 22.1, Chap. 19 ). Initially, data visualization (Sect. 22.1, below) is important to determine outliers and to get a feel for the nature of the data and whether there appear to be any differences among the groups being examined. From there, the data must be pre-processed (Sect. 22.2) so that outliers are handled, missing values are dealt with, and normality is assessed. Once the processed data has been cleaned and is ready for downstream analysis, hypothesis tests (Sect. 22.3) are performed, and proteins that are differentially expressed are identified. Since the number of differentially expressed proteins is usually larger than warrants further investigation (50+ proteins versus just a handful that will be considered for a biomarker panel), some sort of feature reduction (Sect. 22.4) should be performed to narrow the list of candidate biomarkers down to a more reasonable number. Once the list of proteins has been reduced to those that are likely most useful for downstream classification purposes, unsupervised or supervised learning is performed (Sects. 22.5 and 22.6, respectively).

Keywords: Candidate biomarker selection; Data clustering; Data consistency; Data inspection; Data normalization; Data transformations; Machine learning; Outlier detection.

PubMed Disclaimer

Figures

**Fig. 22.1**
Histograms for IP_10 Cytokine data. Dengue Fever is on the top; Dengue Hemorrhagic Fever is on the bottom

**Fig. 22.2**
Boxplots for IP_10 Cytokine data. Dengue Fever is on the left; Dengue Hemorrhagic Fever is on the right

**Fig. 22.3**
(a) Q-Q plot of IP-10 cytokine data for Dengue Hemorrhagic Fever, (b) Q-Q plot of IP-10 cytokine data for Dengue Fever

**Fig. 22.4**
SAM result for Aspergillosis dataset

**Fig. 22.5**
Hierarchical clustering of Dengue Fever study. Subjects labeled 1–30 are subjects with Dengue Fever; Subjects labeled 31–52 are subjects with Dengue Hemorrhagic Fever

**Fig. 22.6**
(a) CART tree for DF vs DHF comparison, (b) Variable importance for the CART model

**Fig. 22.7**
ROC Curves for both the training and testing datasets. The blue curve represents the training data, and the red curve represents the testing data. The AUC for the training data is 0.90 and the AUC for the testing data is 0.47

**Fig. 22.8**
Random forests variable importance for the top 20 most important spots

**Fig. 22.9**
ROC curve for the data. The AUC for the ROC is 0.77

**Fig. 22.11**
ROC curves for the training and testing data. The blue curve represents the training data and the red curve represents the testing data. The AUC for the training data is 1.0 and theAUC for the testing data is 0.63

**Fig. 22.12**
GPS variable importance for the top 20 most important spots

**Fig. 22.13**
ROC curve for the data. The blue curve represents the training data; the red curve represents the testing data. The AUC for the training ROC is 1.0; the AUC for the testing data is 0.92

See this image and copyright information in PMC

Cited by

Application of SWATH Mass Spectrometry and Machine Learning in the Diagnosis of Inflammatory Bowel Disease Based on the Stool Proteome.
Shajari E, Gagné D, Malick M, Roy P, Noël JF, Gagnon H, Brunet MA, Delisle M, Boisvert FM, Beaulieu JF. Shajari E, et al. Biomedicines. 2024 Feb 1;12(2):333. doi: 10.3390/biomedicines12020333. Biomedicines. 2024. PMID: 38397935 Free PMC article.
Computational advances of tumor marker selection and sample classification in cancer proteomics.
Tang J, Wang Y, Luo Y, Fu J, Zhang Y, Li Y, Xiao Z, Lou Y, Qiu Y, Zhu F. Tang J, et al. Comput Struct Biotechnol J. 2020 Jul 17;18:2012-2025. doi: 10.1016/j.csbj.2020.07.009. eCollection 2020. Comput Struct Biotechnol J. 2020. PMID: 32802273 Free PMC article. Review.
Breath Biopsy^® to Identify Exhaled Volatile Organic Compounds Biomarkers for Liver Cirrhosis Detection.
Ferrandino G, De Palo G, Murgia A, Birch O, Tawfike A, Smith R, Debiram-Beecham I, Gandelman O, Kibble G, Lydon AM, Groves A, Smolinska A, Allsworth M, Boyle B, van der Schee MP, Allison M, Fitzgerald RC, Hoare M, Snowdon VK. Ferrandino G, et al. J Clin Transl Hepatol. 2023 Jun 28;11(3):638-648. doi: 10.14218/JCTH.2022.00309. Epub 2023 Feb 2. J Clin Transl Hepatol. 2023. PMID: 36969895 Free PMC article.
Lessons and tips for designing a machine learning study using EHR data.
Arbet J, Brokamp C, Meinzen-Derr J, Trinkley KE, Spratt HM. Arbet J, et al. J Clin Transl Sci. 2020 Jul 24;5(1):e21. doi: 10.1017/cts.2020.513. J Clin Transl Sci. 2020. PMID: 33948244 Free PMC article. Review.

References

1. Batista G, Monard M (2002) A study of K-nearest neighbour as an imputation method. Hybrid Intelligent Systems, Santiago, Chile, pp 251–260
1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:125–133
1. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont
1. Breiman L (2001) Random forests-random features. University of California, Berkeley
1. Carroll R, Ruppert A, Stefanski L, Crainiceanu C (2006) Measurement error in nonlinear models: a modern perspective, 2nd edn. CRC Press, London

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Statistical Approaches to Candidate Biomarker Panel Selection

Affiliations

Statistical Approaches to Candidate Biomarker Panel Selection

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources