Variable selection in omics data: A practical evaluation of small sample sizes

Alexander Kirpich^{1

2}, Elizabeth A Ainsworth^{3

4}, Jessica M Wedow³, Jeremy R B Newman¹, George Michailidis^{2

5}, Lauren M McIntyre^{1

2

6}

Affiliations

¹ Department of Biology, University of Florida, Gainesville, FL, United States of America.
² Informatics Institute, University of Florida, Gainesville, FL, United States of America.
³ Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America.
⁴ USDA ARS Global Change and Photosynthesis Research Unit, Urbana, IL, United States of America.
⁵ Department of Statistics, University of Florida, Gainesville, FL, United States of America.
⁶ Genetics Institute, University of Florida, Gainesville, FL, United States of America.

PMID: 29927942
PMCID: PMC6013185
DOI: 10.1371/journal.pone.0197910

Variable selection in omics data: A practical evaluation of small sample sizes

Alexander Kirpich et al. PLoS One. 2018.

. 2018 Jun 21;13(6):e0197910.

doi: 10.1371/journal.pone.0197910. eCollection 2018.

Authors

Alexander Kirpich^{1

2}, Elizabeth A Ainsworth^{3

4}, Jessica M Wedow³, Jeremy R B Newman¹, George Michailidis^{2

5}, Lauren M McIntyre^{1

2

6}

Affiliations

¹ Department of Biology, University of Florida, Gainesville, FL, United States of America.
² Informatics Institute, University of Florida, Gainesville, FL, United States of America.
³ Department of Plant Biology, University of Illinois at Urbana-Champaign, Urbana, IL, United States of America.
⁴ USDA ARS Global Change and Photosynthesis Research Unit, Urbana, IL, United States of America.
⁵ Department of Statistics, University of Florida, Gainesville, FL, United States of America.
⁶ Genetics Institute, University of Florida, Gainesville, FL, United States of America.

PMID: 29927942
PMCID: PMC6013185
DOI: 10.1371/journal.pone.0197910

Abstract

In omics experiments, variable selection involves a large number of metabolites/ genes and a small number of samples (the n < p problem). The ultimate goal is often the identification of one, or a few features that are different among conditions- a biomarker. Complicating biomarker identification, the p variables often contain a correlation structure due to the biology of the experiment making identifying causal compounds from correlated compounds difficult. Additionally, there may be elements in the experimental design (blocks, batches) that introduce structure in the data. While this problem has been discussed in the literature and various strategies proposed, the over fitting problems concomitant with such approaches are rarely acknowledged. Instead of viewing a single omics experiment as a definitive test for a biomarker, an unrealistic analytical goal, we propose to view such studies as screening studies where the goal of the study is to reduce the number of features present in the second round of testing, and to limit the Type II error. Using this perspective, the performance of LASSO, ridge regression and Elastic Net was compared with the performance of an ANOVA via a simulation study and two real data comparisons. Interestingly, a dramatic increase in the number of features had no effect on Type I error for the ANOVA approach. ANOVA, even without multiple test correction, has a low false positive rates in the scenarios tested. The Elastic Net has an inflated Type I error (from 10 to 50%) for small numbers of features which increases with sample size. The Type II error rate for the ANOVA is comparable or lower than that for the Elastic Net leading us to conclude that an ANOVA is an effective analytical tool for the initial screening of features in omics experiments.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1**
Visualization of power (left column) and Type I error (right column) estimates comparison between p = 205 (solid lines) and p = 2050 (dashed line) features for ρ = 0.4 and sample sizes n = 10 (top row), n = 50 (middle row), and n = 100 (bottom row). The value of the penalty split parameter α is plotted on the x-axis. Type I error and power estimates are plotted on y-axis for the values of α in the range of [0; 1] with 0.1 increments. In the left column power estimates are provided based on the four different features for each of the effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line). In the right column Type I error estimates are provided (beige lines) based on the random noise features together with a 0.05 threshold plotted as a purple dashed line. The vertical dashed line in the right column plots corresponds to penalty split value α = 0.5. The value of α = 0 corresponds to ridge regression and α = 1 corresponds to LASSO.

**Fig 2**
Visualization of power (left column) and Type I error (right column) estimates. Comparison between p = 205 (solid lines) and p = 2050 (dashed line) features for ρ = 0.8 and sample sizes n = 10 (top row), n = 50 (middle row), and n = 100 (bottom row). The value of the penalty split parameter α is plotted on the x-axis. Type I error and power estimates are plotted on y-axis for the values of α in the range of [0; 1] with 0.1 increments. In the left column power estimates are provided based on the four different features for each of the effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line). In the right column Type I error estimates are provided (beige lines) based on the random noise features together with a 0.05 threshold plotted as a purple dashed line. The vertical dashed line in the right column plots corresponds to penalty split value α = 0.5. The value of α = 0 corresponds to ridge regression and α = 1 corresponds to LASSO.

**Fig 3. Visualization of power and Type I error estimates comparison for p = 205 features, correlation ρ = 0.4, and all sample sizes.**
Each row of the plots corresponds to a feature selection method. ANOVA FDR adjustment cutoff is 0.2. The value of the sample size (n) is displayed on the x-axis in all plots. The estimates of power and Type I error are provided on the y-axis. In the left column power estimates are provided based on the four different features for each of the effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line). In the right column Type I error estimates are provided (beige lines) based on the random noise features together with a 0.05 threshold plotted as a purple dashed line. In the middle column the proportions of non-different detected features within each block correlated to different ones for each of the blocks and corresponding effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line) are displayed.

**Fig 4. Visualization of power and Type I error estimates comparison for p = 2050 features, correlation ρ = 0.4, and all sample sizes.**
Each row of the plots corresponds to a feature selection method. ANOVA FDR adjustment cutoff is 0.2. The value of the sample size (n) is displayed on the x-axis in all plots. The estimates of power and Type I error are provided on the y-axis. In the left column power estimates are provided based on the four different features for each of the effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line). In the right column Type I error estimates are provided (beige lines) based on the random noise features together with a 0.05 threshold plotted as a purple dashed line. In the middle column the proportions of non-different detected features within each block correlated to different ones for each of the blocks and corresponding effect sizes (Δ₁ = 0.2 is the red line, Δ₂ = 0.5 is the blue line, and Δ₃ = 0.8 is the green line) are displayed.

**Fig 5. Venn diagrams depicting the results for the maize data.**
ANOVA (Green), Elastic Net (Blue) and LASSO (Brown) are compared. In Panel A the positive ion mode is shown with an FDR for the ANOVA of 0.05 while in Panel B the negative ion mode is show. Panel C is the positive ion mode with FDR of 0.2 and Panel D is the negative ion mode.

See this image and copyright information in PMC

References

1. Katajamaa M, Miettinen J, Orešič M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics. 2006;22(5):634–636. doi: 10.1093/bioinformatics/btk039 - DOI - PubMed
1. Dunn WB, Wilson ID, Nicholls AW, Broadhurst D. The importance of experimental design and QC samples in large-scale and MS-driven untargeted metabolomic studies of humans. 2012;. - PubMed
1. Dunn WB, Erban A, Weber RJ, Creek DJ, Brown M, Breitling R, et al. Mass appeal: metabolite identification in mass spectrometry-focused untargeted metabolomics. Metabolomics. 2013;9(1):44–66. doi: 10.1007/s11306-012-0434-4 - DOI
1. Creek DJ, Dunn WB, Fiehn O, Griffin JL, Hall RD, Lei Z, et al. Metabolite identification: are you sure? And how do your peers gauge your confidence? Metabolomics. 2014;10(3):350 doi: 10.1007/s11306-014-0656-8 - DOI
1. Alonso A, Marsal S, Julià A. Analytical methods in untargeted metabolomics: state of the art in 2015. Frontiers in bioengineering and biotechnology. 2015;3:23 doi: 10.3389/fbioe.2015.00023 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Variable selection in omics data: A practical evaluation of small sample sizes

Affiliations

Variable selection in omics data: A practical evaluation of small sample sizes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources