Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Dec;34(8):879-91.
doi: 10.1002/gepi.20543.

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Affiliations
Free PMC article

SNP selection in genome-wide and candidate gene studies via penalized logistic regression

Kristin L Ayers et al. Genet Epidemiol. 2010 Dec.
Free PMC article

Abstract

Penalized regression methods offer an attractive alternative to single marker testing in genetic association analysis. Penalized regression methods shrink down to zero the coefficient of markers that have little apparent effect on the trait of interest, resulting in a parsimonious subset of what we hope are true pertinent predictors. Here we explore the performance of penalization in selecting SNPs as predictors in genetic association studies. The strength of the penalty can be chosen either to select a good predictive model (via methods such as computationally expensive cross validation), through maximum likelihood-based model selection criterion (such as the BIC), or to select a model that controls for type I error, as done here. We have investigated the performance of several penalized logistic regression approaches, simulating data under a variety of disease locus effect size and linkage disequilibrium patterns. We compared several penalties, including the elastic net, ridge, Lasso, MCP and the normal-exponential-γ shrinkage prior implemented in the hyperlasso software, to standard single locus analysis and simple forward stepwise regression. We examined how markers enter the model as penalties and P-value thresholds are varied, and report the sensitivity and specificity of each of the methods. Results show that penalized methods outperform single marker analysis, with the main difference being that penalized methods allow the simultaneous inclusion of a number of markers, and generally do not allow correlated variables to enter the model, producing a sparse model in which most of the identified explanatory markers are accounted for.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Analysis of simulated data from the CYP2D6 gene region assuming five causal loci with MAFs <10%. The first five plots show the absolute values of the regression coefficients for the program hyperlasso [Hoggart et al., 2008] as the penalty parameter λ is relaxed. The final plot is the −log P-values for the Armitage Trend test. Each causal locus is marked by a vertical line.
Fig. 2
Fig. 2
Plots of the negative of the penalty functions −λf(β). The penalty (y-axis) is plotted against β (x-axis) for the Lasso, elastic net, ridge and MCP. The last plot is the NEG penalty f(β, λ), the log density of the NEG prior. The peaks of each function are at β = 0. In these plots, for each method, a λ value was selected to allow the penalty functions to be plotted on approximately the same scale. Other parameter values (such as the mixing parameter α in the elastic net) were set to the values used in the analysis.
Fig. 3
Fig. 3
LD plots (pairwise r2) in three gene regions.
Fig. 4
Fig. 4
Sensitivity (detection rates) versus 1-specificity (false-positive rates) as the penalty parameter λ is varied. Results are for seven different methods over 4,000 simulated SNPs with six causal loci. Note the difference in axis scales, as we are interested in low false-positive rates.
Fig. 5
Fig. 5
Sensitivity versus 1-specificity as the penalty parameter λ is varied in gene regions. The results for each gene region under scenario 6 (five causal loci). The top row shows results for rare causal alleles while the bottom row shows results for common alleles.
Fig. 6
Fig. 6
Maximum LD of missed causal loci with detected loci. Results are shown as histograms of maximum LD (r2) of a missed causal locus with markers in the model. Presented are the results for the CYP2D6 gene region under scenario 6 (five common causal alleles) for the λ value chosen through permutation, where the y-axes are the counts over the 500 replicates.
Fig. 7
Fig. 7
Maximum LD of false positives with causal loci. Results are shown as histograms of maximum LD (r2) that a false positive shares with any causal locus. Presented are the results for the CYP2D6 gene region under scenario 6 (five common causal alleles) for the λ value chosen through permutation, where the y-axes are the counts over the 500 replicates.

References

    1. Barratt BJ, Payne F, Lowe CE, Hermann R, Healy BC, Harold D, Concannon P, Gharani N, McCarthy MI, Olavensen MG, McCormack R, Guja C, Ionescu-Tirgoviste C, Undlien DE, Ronningen KS, Gillespie KM, Tuomilehto-Wolf E, Tuomilehto J, Benett ST, Clayton DG. Remapping the insulin gene/IDDM2 locus in type 1 diabetes. Diabetes. 2004;53:1884–1889. - PubMed
    1. Breheny P, Huang J. 2008. Penalized methods for bi-level variable selection. Technical Report 393, Department of Statistics and Actuarial Science, University of Iowa.
    1. Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. doi: http://dx.doi.org/10.2307/1269730. - DOI
    1. Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140.
    1. Chadeau-Hyam M, Hoggart CJ, O'Reilly PF, Whittaker JC, Iorio MD, Balding DJ. Fregene: simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics. 2008;9:364. - PMC - PubMed

Publication types

LinkOut - more resources