Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Dec;82(5):1273-1300.
doi: 10.1111/rssb.12388. Epub 2020 Jul 10.

A simple new approach to variable selection in regression, with application to genetic fine mapping

Affiliations

A simple new approach to variable selection in regression, with application to genetic fine mapping

Gao Wang et al. J R Stat Soc Series B Stat Methodol. 2020 Dec.

Abstract

We introduce a simple new approach to variable selection in linear regression, with a particular focus on quantifying uncertainty in which variables should be selected. The approach is based on a new model - the "Sum of Single Effects" (SuSiE) model - which comes from writing the sparse vector of regression coefficients as a sum of "single-effect" vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure - Iterative Bayesian Stepwise Selection (IBSS) - which is a Bayesian analogue of stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We provide a formal justification of this intuitive algorithm by showing that it optimizes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally yields convenient novel summaries of uncertainty in variable selection, providing a Credible Set of variables for each selection. Our methods are particularly well-suited to settings where variables are highly correlated and detectable effects are sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate their application to fine-mapping genetic variants influencing alternative splicing in human cell-lines. We also discuss the potential and challenges for applying these methods to generic variable selection problems.

Keywords: genetic fine-mapping; linear regression; sparse; variable selection; variational inference.

PubMed Disclaimer

Figures

FIGURE A1
FIGURE A1. Correlations among variables (SNPs) in an example data set used in the fine mapping comparisons.
Left-hand panel shows correlations among variables shown at positions 100–200 in Fig. 1; right-hand panel shows correlations among variables shown at positions 350–450. For more details on this example data set, see Section 4.1 in the main text.
FIGURE A2
FIGURE A2. Splicing QTL enrichment analysis results.
Estimated odds ratios, and ± 2 standard errors, for each variant annotation, obtained by comparing the annotations of SNPs inside primary/secondary CSs against random “control” SNPs outside CSs. The p-values are from two-sided Fisher’s exact test, without multiple testing correction. The vertical line in each plot is posited at odds ratio = 1.
FIGURE 1
FIGURE 1. Fine-mapping example to illustrate that IBSS algorithm can deal with a challenging case.
Results are from a simulated data set with p = 1, 000 variables (SNPs). Some of these variables are very strongly correlated (Figure A1). Two out of the 1,000 variables are effect variables (red points, labeled “SNP 1” and “SNP 2” in the left-hand panel). We chose this example from our simulations because the strongest marginal association (SMA) is a non-effect variable (yellow point, labeled “SMA” in the left-hand panel). After 1 iteration (middle panel), IBSS incorrectly identifies a CS containing the SMA and no effect variable (orange points). However, after 10 iterations (and also at convergence) the IBSS algorithm has corrected itself (right-panel), finding two 95% CSs — dark blue and green open circles — each containing a true effect variable. Additionally, neither CS contains the SMA variable. One CS (blue) contains only 3 SNPs (purity of 0.85), whereas the other CS (green) contains 37 very highly correlated variables (purity of 0.97). In the latter CS, the individual PIPs are small, but the inclusion of the 37 variables in this CS indicates, correctly, high confidence in at least one effect variable among them.
FIGURE 2
FIGURE 2. Evaluation of posterior inclusion probabilities (PIPs).
Scatterplots in Panel A compare PIPs computed by SuSiE against PIPs computed using other methods (DAP-G, CAVIAR, FINEMAP). Each point depicts a single variable in one of the simulations: dark red points represent true effect variables, whereas light gray points represent variables with no effect. The scatterplot in Panel B combine results across the first set of simulations. Panel B summarizes power versus FDR from the first simulation scenario of. These curves are obtained by independently varying the PIP threshold for each method. The open circles in the left-hand plot highlight power versus FDR at PIP thresholds of 0.9 and 0.95). These quantities are calculated as FDRFPTP+FP (also known as the “false discovery proportion”) and  power TPTP+FN, where FP, TP, FN and TN denote the number of False Positives, True Positives, False Negatives and True Negatives, respectively. (This plot is the same as a precision-recall curve after reversing the x-axis, because  precision =TPTP+FP=1FDR, and recall = power.) Note that CAVIAR and FINEMAP were run only on data sets with 1 − 3 effect variables.
FIGURE 3
FIGURE 3. Comparison of 95% credible sets (CS) from SuSiE and DAP-G.
Panels show A) coverage, B) power, C) median size and D) average squared correlation of the variables in each CS. These statistics are taken as mean over all CSs computed in all data sets; error bars in Panel A show 2 × standard error. Simulations with 1–5 effect variables are from the first simulation scenario, and simulations with 10 effect variables are from the second scenario.
FIGURE 4
FIGURE 4. Illustration of SuSiE applied to two change point problems.
The top panel shows a simulated example with seven change points (the vertical black lines). The blue horizontal lines show the mean function inferred by the segment method from the DNAcopy R package (version 1.56.0). The inference is reasonably accurate — all change points except the left-most one are nearly exactly recovered — but provides no indication of uncertainty in the locations of the change points. The red regions depict the 95% CSs for change point locations inferred by SuSiE; in this example, every CS contains a true change point. The bottom panel shows a simulated example with two change points in quick succession. This example is intended to illustrate convergence of the IBSS algorithm to a (poor) local optimum. The black line shows the fit from the IBSS algorithm when it is initialized to a null model in which there are no change points; this fit results in no change points being detected. The red line also shows the result of running IBSS, but this time the fitting algorithm is initialized to the true model with two change points. The latter accurately recovers both change points, and attains a higher value of the objective function (−148.2 versus −181.8).

References

    1. Arnold T and Tibshirani R (2016) Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25, 1–27.
    1. Barber RF and Candés EJ (2015) Controlling the false discovery rate via knockoffs. Annals of Statistics, 43, 2055–2085.
    1. Benner C, Spencer CC, Havulinna AS, Salomaa V, Ripatti S and Pirinen M (2016) FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics, 32, 1493–1501. - PMC - PubMed
    1. Bertsimas D, King A and Mazumder R (2016) Best subset selection via a modern optimization lens. Annals of Statistics, 44, 813–852.
    1. Blei DM, Kucukelbir A and McAuliffe JD (2017) Variational inference: A review for statisticians. Journal of the American Statistical Association, 112, 859–877.