Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2012 May 10:13:88.
doi: 10.1186/1471-2105-13-88.

SparSNP: fast and memory-efficient analysis of all SNPs for phenotype prediction

Affiliations
Comparative Study

SparSNP: fast and memory-efficient analysis of all SNPs for phenotype prediction

Gad Abraham et al. BMC Bioinformatics. .

Abstract

Background: A central goal of genomics is to predict phenotypic variation from genetic variation. Fitting predictive models to genome-wide and whole genome single nucleotide polymorphism (SNP) profiles allows us to estimate the predictive power of the SNPs and potentially develop diagnostic models for disease. However, many current datasets cannot be analysed with standard tools due to their large size.

Results: We introduce SparSNP, a tool for fitting lasso linear models for massive SNP datasets quickly and with very low memory requirements. In analysis on a large celiac disease case/control dataset, we show that SparSNP runs substantially faster than four other state-of-the-art tools for fitting large scale penalised models. SparSNP was one of only two tools that could successfully fit models to the entire celiac disease dataset, and it did so with superior performance. Compared with the other tools, the models generated by SparSNP had better than or equal to predictive performance in cross-validation.

Conclusions: Genomic datasets are rapidly increasing in size, rendering existing approaches to model fitting impractical due to their prohibitive time or memory requirements. This study shows that SparSNP is an essential addition to the genomic analysis toolkit.SparSNP is available at http://www.genomics.csse.unimelb.edu.au/SparSNP.

PubMed Disclaimer

Figures

Figure 1
Figure 1
SparSNP analysis pipeline. An example pipeline for analysing a SNP discovery dataset with SparSNP and testing the model on a validation dataset. Most of the data preparation and processing can be done with PLINK
Figure 2
Figure 2
Timing experiments. Time (in seconds) for model fitting, over sub-samples of the celiac disease dataset, taken as the minimum time over 10 independent runs. The inset panel shows the results for 50,000 SNPs in more detail, note the different scales. For in-memory methods we included the time to read the data into memory. For SparSNP and glmnet we used a penalty grid of size 20, and a maximum model size of 2048 SNPs. LIBLINEAR (denoted “LL-L1”) and LIBLINEAR-CDBLOCK (denoted “LL-CD-L2”) induced one model with C = 1. LIBLINEAR-CDBLOCK used m = 50 blocks. For some datasets, glmnet and LIBLINEAR did not complete and these running times are not shown. HyperLasso is not shown since it took much longer to complete than the other methods
Figure 3
Figure 3
Prediction experiments. LOESS-smoothed AUC and explained phenotypic variance (denoted “VarExp”), for the Finnish celiac disease dataset, for increasing model sizes. AUC is estimated over 20×3-fold cross-validation, except for HyperLasso for which we ran only 2×3-fold cross-validation due to the high computational cost. The explained phenotypic variance is estimated from the AUC using the method of [11], assuming a population prevalence of celiac disease K=1%. Note that glmnet, HyperLasso, LIBLINEAR (denoted “LL-L1”), and SparSNP used an 1-penalised model, whereas LIBLINEAR-CDBLOCK (denoted “LL-CD-L2”) used an 2-penalised model (non sparse), inducing a model using all 516,504 SNPs, therefore it is shown as a horizontal line across all model sizes. Note that tuning the 2penalty for LIBLINEAR-CDBLOCK resulted in very similar AUC

References

    1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM. Finding the missing heritability of complex diseases. Nature. 2009;461::747753. - PMC - PubMed
    1. Tibshirani R. Regression Shrinkage and Selection via the Lasso. J R Statist Soc B. 1996;58::267288.
    1. Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics. 2009;25::714721. - PMC - PubMed
    1. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81::559575. - PMC - PubMed
    1. Dubois PCA, Trynka G, Franke L, Hunt KA, Romanos J, Curtotti A, Zhernakova A, Heap GAR, Ádány R, Aromaa A, Bardella MT, van den Berg LH, Bockett NA, de la Concha EG, Dema B, Fehrmann RSN, Fernández-Arquero M, Fiatal S, Grandone E, Green PM, Groen HJM, Gwilliam R, Houwen RHJ, Hunt SE, Kaukinen K, Kelleher D, Korponay-Szabo I, Kurppa K, Macmathuna P, Mäki M, Mazzilli MC, Mccann OT, Mearin ML, Mein CA, Mirza MM, Mistry V, Mora B, Morley KI, Mulder CJ, Murray JA, Núñez C, Oosterom E, Ophoff RA, Polanco I, Peltonen L, Platteel M, Rybak A, Salomaa V, Schweizer JJ, Sperandeo MP, Tack GJ, Turner G, Veldink JH, Verbeek WHM, Weersma RK, Wolters VM, Urcelay E, Cukrowska B, Greco L, Neuhausen SL, McManus R, Barisani D, Deloukas P, Barrett JC, Saavalainen P, Wijmenga C, van Heel DA. Multiple common variants for celiac disease influencing immune gene expression. Nat Genet. 2010;42::295304. - PMC - PubMed

Publication types

LinkOut - more resources