. 2010 Dec;34(8):879-91.

doi: 10.1002/gepi.20543.

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Kristin L Ayers¹, Heather J Cordell

Affiliations

PMID: 21104890
PMCID: PMC3410531
DOI: 10.1002/gepi.20543

Free PMC article

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Kristin L Ayers et al. Genet Epidemiol. 2010 Dec.

Free PMC article

. 2010 Dec;34(8):879-91.

doi: 10.1002/gepi.20543.

Authors

Kristin L Ayers¹, Heather J Cordell

Affiliation

¹ Institute of Human Genetics, Central Parkway, Newcastle upon Tyne, United Kingdom. kayers@ucla.edu

PMID: 21104890
PMCID: PMC3410531
DOI: 10.1002/gepi.20543

Abstract

Penalized regression methods offer an attractive alternative to single marker testing in genetic association analysis. Penalized regression methods shrink down to zero the coefficient of markers that have little apparent effect on the trait of interest, resulting in a parsimonious subset of what we hope are true pertinent predictors. Here we explore the performance of penalization in selecting SNPs as predictors in genetic association studies. The strength of the penalty can be chosen either to select a good predictive model (via methods such as computationally expensive cross validation), through maximum likelihood-based model selection criterion (such as the BIC), or to select a model that controls for type I error, as done here. We have investigated the performance of several penalized logistic regression approaches, simulating data under a variety of disease locus effect size and linkage disequilibrium patterns. We compared several penalties, including the elastic net, ridge, Lasso, MCP and the normal-exponential-γ shrinkage prior implemented in the hyperlasso software, to standard single locus analysis and simple forward stepwise regression. We examined how markers enter the model as penalties and P-value thresholds are varied, and report the sensitivity and specificity of each of the methods. Results show that penalized methods outperform single marker analysis, with the main difference being that penalized methods allow the simultaneous inclusion of a number of markers, and generally do not allow correlated variables to enter the model, producing a sparse model in which most of the identified explanatory markers are accounted for.

PubMed Disclaimer

Figures

**Fig. 1**
Analysis of simulated data from the *CYP2D6* gene region assuming five causal loci with MAFs <10%. The first five plots show the absolute values of the regression coefficients for the program *hyperlasso* [Hoggart et al., 2008] as the penalty parameter λ is relaxed. The final plot is the −log P-values for the Armitage Trend test. Each causal locus is marked by a vertical line.

**Fig. 2**
Plots of the negative of the penalty functions −λf(β). The penalty (y-axis) is plotted against β (x-axis) for the Lasso, elastic net, ridge and MCP. The last plot is the NEG penalty f(β, λ), the log density of the NEG prior. The peaks of each function are at β = 0. In these plots, for each method, a λ value was selected to allow the penalty functions to be plotted on approximately the same scale. Other parameter values (such as the mixing parameter α in the elastic net) were set to the values used in the analysis.

**Fig. 3**
LD plots (pairwise r²) in three gene regions.

**Fig. 4**
Sensitivity (detection rates) versus 1-specificity (false-positive rates) as the penalty parameter λ is varied. Results are for seven different methods over 4,000 simulated SNPs with six causal loci. Note the difference in axis scales, as we are interested in low false-positive rates.

**Fig. 5**
Sensitivity versus 1-specificity as the penalty parameter λ is varied in gene regions. The results for each gene region under scenario 6 (five causal loci). The top row shows results for rare causal alleles while the bottom row shows results for common alleles.

**Fig. 6**
Maximum LD of missed causal loci with detected loci. Results are shown as histograms of maximum LD (r²) of a missed causal locus with markers in the model. Presented are the results for the *CYP2D6* gene region under scenario 6 (five common causal alleles) for the λ value chosen through permutation, where the y-axes are the counts over the 500 replicates.

**Fig. 7**
Maximum LD of false positives with causal loci. Results are shown as histograms of maximum LD (r²) that a false positive shares with any causal locus. Presented are the results for the *CYP2D6* gene region under scenario 6 (five common causal alleles) for the λ value chosen through permutation, where the y-axes are the counts over the 500 replicates.

See this image and copyright information in PMC

References

1. Barratt BJ, Payne F, Lowe CE, Hermann R, Healy BC, Harold D, Concannon P, Gharani N, McCarthy MI, Olavensen MG, McCormack R, Guja C, Ionescu-Tirgoviste C, Undlien DE, Ronningen KS, Gillespie KM, Tuomilehto-Wolf E, Tuomilehto J, Benett ST, Clayton DG. Remapping the insulin gene/IDDM2 locus in type 1 diabetes. Diabetes. 2004;53:1884–1889. - PubMed
1. Breheny P, Huang J. 2008. Penalized methods for bi-level variable selection. Technical Report 393, Department of Statistics and Actuarial Science, University of Iowa.
1. Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. doi: http://dx.doi.org/10.2307/1269730. - DOI
1. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140.
1. Chadeau-Hyam M, Hoggart CJ, O'Reilly PF, Whittaker JC, Iorio MD, Balding DJ. Fregene: simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics. 2008;9:364. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

087436/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Affiliation

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous