Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2016:919:463-492.
doi: 10.1007/978-3-319-41448-5_22.

Statistical Approaches to Candidate Biomarker Panel Selection

Affiliations
Review

Statistical Approaches to Candidate Biomarker Panel Selection

Heidi M Spratt et al. Adv Exp Med Biol. 2016.

Abstract

The statistical analysis of robust biomarker candidates is a complex process, and is involved in several key steps in the overall biomarker development pipeline (see Fig. 22.1, Chap. 19 ). Initially, data visualization (Sect. 22.1, below) is important to determine outliers and to get a feel for the nature of the data and whether there appear to be any differences among the groups being examined. From there, the data must be pre-processed (Sect. 22.2) so that outliers are handled, missing values are dealt with, and normality is assessed. Once the processed data has been cleaned and is ready for downstream analysis, hypothesis tests (Sect. 22.3) are performed, and proteins that are differentially expressed are identified. Since the number of differentially expressed proteins is usually larger than warrants further investigation (50+ proteins versus just a handful that will be considered for a biomarker panel), some sort of feature reduction (Sect. 22.4) should be performed to narrow the list of candidate biomarkers down to a more reasonable number. Once the list of proteins has been reduced to those that are likely most useful for downstream classification purposes, unsupervised or supervised learning is performed (Sects. 22.5 and 22.6, respectively).

Keywords: Candidate biomarker selection; Data clustering; Data consistency; Data inspection; Data normalization; Data transformations; Machine learning; Outlier detection.

PubMed Disclaimer

Figures

Fig. 22.1
Fig. 22.1
Histograms for IP_10 Cytokine data. Dengue Fever is on the top; Dengue Hemorrhagic Fever is on the bottom
Fig. 22.2
Fig. 22.2
Boxplots for IP_10 Cytokine data. Dengue Fever is on the left; Dengue Hemorrhagic Fever is on the right
Fig. 22.3
Fig. 22.3
(a) Q-Q plot of IP-10 cytokine data for Dengue Hemorrhagic Fever, (b) Q-Q plot of IP-10 cytokine data for Dengue Fever
Fig. 22.4
Fig. 22.4
SAM result for Aspergillosis dataset
Fig. 22.5
Fig. 22.5
Hierarchical clustering of Dengue Fever study. Subjects labeled 1–30 are subjects with Dengue Fever; Subjects labeled 31–52 are subjects with Dengue Hemorrhagic Fever
Fig. 22.6
Fig. 22.6
(a) CART tree for DF vs DHF comparison, (b) Variable importance for the CART model
Fig. 22.7
Fig. 22.7
ROC Curves for both the training and testing datasets. The blue curve represents the training data, and the red curve represents the testing data. The AUC for the training data is 0.90 and the AUC for the testing data is 0.47
Fig. 22.8
Fig. 22.8
Random forests variable importance for the top 20 most important spots
Fig. 22.9
Fig. 22.9
ROC curve for the data. The AUC for the ROC is 0.77
Fig. 22.10
Fig. 22.10
MARS variable importance
Fig. 22.11
Fig. 22.11
ROC curves for the training and testing data. The blue curve represents the training data and the red curve represents the testing data. The AUC for the training data is 1.0 and theAUC for the testing data is 0.63
Fig. 22.12
Fig. 22.12
GPS variable importance for the top 20 most important spots
Fig. 22.13
Fig. 22.13
ROC curve for the data. The blue curve represents the training data; the red curve represents the testing data. The AUC for the training ROC is 1.0; the AUC for the testing data is 0.92
Fig. 22.14
Fig. 22.14
Partial residual plot

Similar articles

Cited by

References

    1. Batista G, Monard M (2002) A study of K-nearest neighbour as an imputation method. Hybrid Intelligent Systems, Santiago, Chile, pp 251–260
    1. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:125–133
    1. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont
    1. Breiman L (2001) Random forests-random features. University of California, Berkeley
    1. Carroll R, Ruppert A, Stefanski L, Crainiceanu C (2006) Measurement error in nonlinear models: a modern perspective, 2nd edn. CRC Press, London

MeSH terms

LinkOut - more resources