Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Aug;47(8):955-61.
doi: 10.1038/ng.3331. Epub 2015 Jun 15.

A method to predict the impact of regulatory variants from DNA sequence

Affiliations

A method to predict the impact of regulatory variants from DNA sequence

Dongwon Lee et al. Nat Genet. 2015 Aug.

Abstract

Most variants implicated in common human disease by genome-wide association studies (GWAS) lie in noncoding sequence intervals. Despite the suggestion that regulatory element disruption represents a common theme, identifying causal risk variants within implicated genomic regions remains a major challenge. Here we present a new sequence-based computational method to predict the effect of regulatory variation, using a classifier (gkm-SVM) that encodes cell type-specific regulatory sequence vocabularies. The induced change in the gkm-SVM score, deltaSVM, quantifies the effect of variants. We show that deltaSVM accurately predicts the impact of SNPs on DNase I sensitivity in their native genomic contexts and accurately predicts the results of dense mutagenesis of several enhancers in reporter assays. Previously validated GWAS SNPs yield large deltaSVM scores, and we predict new risk-conferring SNPs for several autoimmune diseases. Thus, deltaSVM provides a powerful computational approach to systematically identify functional regulatory variants.

PubMed Disclaimer

Conflict of interest statement

Financial Interests Statement

The authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Overview of our deltaSVM method
[left] The first step in calculating deltaSVM is to train a gkm-SVM classifier using a positive training set of putative regulatory sequences (identified by DNase I hypersensitivity, for example) and a negative training set of matched negative control sequences. The gkm-SVM generates a regulatory sequence vocabulary – a weighted list of all possible 10-mers, where each 10-mer receives an SVM weight that quantifies its contribution to the prediction of regulatory function. [right] After training, this regulatory sequence vocabulary can be used to score the predicted impact of any sequence variant on regulatory activity, as shown here for a single nucleotide substitution in a melanocyte enhancer of the Tyrp1 enhancer.
Figure 2
Figure 2. deltaSVM can accurately predict SNPs associated with DNaseI Hypersensitivity
(a) An example of a deltaSVM calculation using a known dsQTL SNP (rs4953223). (b) 10-mer gkm-SVM scores across the dsQTL locus containing rs4953223 are shown. Only the functional SNP produces dramatic changes in gkm-SVM scores. (c) Effect sizes of dsQTL SNPs from Ref. are well correlated with their deltaSVM scores. (d–e) deltaSVM predicts dsQTLs with far greater accuracy than existing methods. Discriminative powers are compared between various methods using 50x larger control SNP set. (d) ROC curve. (e) Precision-Recall curve.
Figure 3
Figure 3. deltaSVM is strongly positively correlated with dsQTL effect size, and positively or negatively correlated with eQTL effect size depending on the sign of the correlation of dsQTL and eQTL
Degner et al reported that 16% of the dsQTLs were also eQTLs, but that 30% of the eQTL dsQTLs were anti-correlated with the expression change. Our predictions are consistent with this observation: (a) deltaSVM is always positively correlated with dsQTL effect size (beta), (b) but because eQTL beta and dsQTL beta are anti-correlated 30% of the time, (c) deltaSVM and eQTL beta are only correlated (positively and negatively) if we treat the activating dsQTLs (red) and repressive dsQTLs (blue) separately. (d) Bases predicted to reduce the activity of functional regions are evolutionarily constrained.
Figure 4
Figure 4. deltaSVM accurately predicts change in luciferase expression in targeted mutagenesis of Tyr and Tyrp1 melanocyte enhancers
(a,b) Base by base evaluation of all possible substitutions as scored by deltaSVM. Black circles mark substitutions that were tested in luciferase assays. Orange bars show positions of the previously characterized binding sites. (c,d) Correlation of deltaSVM prediction and observed normalized luciferase expression. Blue circles indicate previously tested binding site,. Error bar is one standard deviation of the changes in luciferase expression (4 biological replicates per variant).
Figure 5
Figure 5. deltaSVM accurately predicts change of expression in massively parallel reporter assays
(a) Correlations of deltaSVM predictions and observed in vivo mutation effect size in the ALDOB enhancer in mice. (b) Correlation of deltaSVM and mutated enhancers in K562 cells. (c) Correlation of deltaSVM and mutated enhancers in HepG2 cells.
Figure 6
Figure 6. deltaSVM only identifies validated causal SNPs when trained on the appropriate cell type
(a) Three validated GWAS SNPs from Rfx6 (1st column), Bcl11a (2nd column), and Sort1 (3rd column) and flanking negative SNPs were each scored with deltaSVM trained on all three relevant cell-types. The validated SNPs are properly identified from among flanking SNPs when trained on the appropriate cell type (red) but not other cell types (blue). (b,c) Scoring autoimmune GWAS loci with deltaSVM trained on Th1 yields high confidence causal SNPs listed in Table 2. BACH2 locus is shown in (b) as an example. Error bar in (c) is one standard deviation of the expected binomial distribution.

Comment in

References

    1. Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. - PMC - PubMed
    1. Maurano MT, et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;337:1190–1195. - PMC - PubMed
    1. Gusev A, et al. Partitioning heritability of regulatory and cell-type-specific variants across 11 common diseases. Am J Hum Genet. 2014;95:535–552. - PMC - PubMed
    1. Kircher M, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–315. - PMC - PubMed
    1. Ritchie GRS, Dunham I, Zeggini E, Flicek P. Functional annotation of noncoding sequence variants. Nat Methods. 2014;11:294–296. - PMC - PubMed

Publication types

MeSH terms

Substances