Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug;50(8):1171-1179.
doi: 10.1038/s41588-018-0160-6. Epub 2018 Jul 16.

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Affiliations

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk

Jian Zhou et al. Nat Genet. 2018 Aug.

Abstract

Key challenges for human genetics, precision medicine and evolutionary biology include deciphering the regulatory code of gene expression and understanding the transcriptional effects of genome variation. However, this is extremely difficult because of the enormous scale of the noncoding mutation space. We developed a deep learning-based framework, ExPecto, that can accurately predict, ab initio from a DNA sequence, the tissue-specific transcriptional effects of mutations, including those that are rare or that have not been observed. We prioritized causal variants within disease- or trait-associated loci from all publicly available genome-wide association studies and experimentally validated predictions for four immune-related diseases. By exploiting the scalability of ExPecto, we characterized the regulatory mutation space for human RNA polymerase II-transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations. This enables probing of evolutionary constraints on gene expression and ab initio prediction of mutation disease effects, making ExPecto an end-to-end computational framework for the in silico prediction of expression and disease risk.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

Figure 1
Figure 1. Deep learning-based sequence model accurately predicts cell type-specific gene expression
a) Schematic overview of the ExPecto sequence-based gene expression prediction framework. The predictive model contains three components, a deep convolutional neural network trained on chromatin profiling data that converts sequence to regulatory features, a spatial feature transformation module, and a linear model that predicts gene expression from transformed nonlinear regulatory representations. b) Sequence-based gene expression predictions on holdout genes are highly correlated with RNA-seq observations. Predicted log RPKMs on 990 genes from the holdout chromosome chr8 (x-axis) were compared with experimentally measured log RPKMs (y-axis) in each of the six example tissues. Spearman correlations between predicted and observed values are shown. c) Cell type-specific expression models capture transcription tissue-specificity. The heatmap shows, on holdout genes, correlations between cell type specific expression profiles measured by log fold change over cell-type-average and the sequence-based predicted log fold changes. d) Predicted mutation effects from in silico mutagenesis of promoter-proximal regions of 23,779 genes showed substantial variation, as indicated by color. The predicted effects for different variants at the same position were averaged. Genes were sorted by gene-wise average predicted mutation effects. Only positions with larger than 0.5 average absolute log fold change were shown. −1000 is upstream of the TSS and +1000 is downstream (by base pair). The whole blood model predictions are shown.
Figure 2
Figure 2. Tissue-specific prediction of expression-altering variations
a). eQTL direction prediction accuracy increases with predicted magnitude of variant effect. Each line shows performance for one eQTL study. x-axis represents the predicted effect magnitude cutoff, as measured by absolute log fold-change. y-axis represents the accuracy of predicting the expression change directionality for the variants above the corresponding effect magnitude. b) GWAS loci with stronger predicted effect variants are more likely to be replicated by separate studies. The generalized additive model fitted curve of replication probability was shown with 95% confidence interval. x-axis shows the max predicted expression absolute log fold-change across all non-cancer tissues. A GWAS locus is considered as replicated if it is within 10kb to the reported SNP of a different study.
Figure 3
Figure 3. Prioritize putative causal variants from GWAS loci with expression effect prediction
(a, c, e). ExPecto expression effect prediction prioritizes putative causal SNPs in inflammatory bowel disease (a), Behcet’s disease(c), and chronic hepatitis B infection (e) GWAS loci. Linkage disequilibrium r2 scores between the reported variant and LD variants in the study population were shown in the top panel (variants are indicated by the × symbols) and the predicted expression effects (maximum across tissues) were shown in the bottom panel (variants are indicated by the dot symbols). The upper panels (GWAS-associated variants) showed the reported SNP(s) from the GWAS studies, indicated by the dashed lines, and all variants in LD with this variant (r2 > 0.25). The lower panels (ExPecto predicted effect) showed the predicted effects of all LD variants and the ExPecto-predicted causal variant is indicated by the dashed line. (b, d, f) Luciferase reporter assay test verified predicted differential transcriptional regulatory activities of sequence elements with the risk allele and with the non-risk allele of prioritized variants, while showing no difference for the GWAS lead variants. Three top prioritized variants near IRGM (b), CCR1(d), and HLA-DOA (f) showed differential transcriptional regulatory activity in the predicted direction while the reported GWAS SNPs show either no transcriptional activation activity or no detectable activity alteration. Luciferase activity is normalized by the empty vector, which is indicated by the dotted line. Statistical significance was based on two-sided t-test. Each allele was tested with at least 11 total replicates from 3 independent experiments (n=11 for the rs7616215 non-risk allele, n=12 for all other alleles). Central values of the boxplot represent the median, box extends from 25th to 75th percentiles, and whiskers extend to the maximum and minimum values.
Figure 4
Figure 4. Variation potential is predictive of gene regulatory specificity, activation status, and evolutionary constraints
a). Schematic overview of association between variation potential, gene expression, and evolutionary constraints. b). Gene expression specificity and activation status can be predicted from the magnitude and directionality of gene variation potential. The position of each gene set is computed as the cumulative mutation effects (directionality) and cumulative absolute mutation effects (magnitude) across all genes in the set. Each gene set is colored by the directionality of variation potential. See Supplementary Fig. 11 for relationship between VP and gene-wise expression properties. Whole blood model predictions are shown as examples here. c). Inference of genes with putative directional evolutionary constraints from variation potentials. Each dot represents a gene. x- and y- axis shows the cumulative predicted mutation effects (log fold-change) of positive and negative impact mutations within 1kb off TSS, respectively. See Methods and Supplementary Fig. 13 for details in determining threshold for calling putative constrained genes. This example shows predictions from the subcutaneous adipose tissue model. d). Evolution and population genetics signatures show differential selective pressure for mutations in putative positive and negative constraint genes across evolutionary time scales. Selection pressures across mutations with different predicted effects (x-axes) are estimated based on proportion of high variance sites among primate species (phyloP < −2.3 which corresponds to p < 0.005 for acceleration; left panel y-axis; primates), divergent sites between human and the inferred human-chimpanzee common ancestor (mid panel y-axis; human-chimpanzee), and common variant sites (minor allele frequency > 0.001) in human populations (right panel y-axis; human population). The error bars show 90% confidence intervals.
Figure 5
Figure 5. Ab initio prediction of allele-specific disease risk integrating predicted expression effects and inferred evolutionary constraints
a). HGMD regulatory disease mutations with strong predicted effects are violators of the putative evolutionary constraints. y-axis shows the ExPecto predicted effects of annotated deleterious mutations (maximum across tissues). x-axis shows the inferred evolutionary constraints measured by variation potential directionality score (sum of gene-wise predicted mutation effects within 1kb to TSS) of the maximum predicted effect tissue. Negative effect mutations with nearest gene being putatively constrained to be high expressing are shown in blue and positive effect mutations with nearest gene being putatively constrained to be low expressing are shown in red. b). Prioritized GWAS LD variant constraint violation score is predictive of whether the reference allele or the alternative-risk allele is the risk allele. The y-axis and x-axis shows the true positive rate and false positive rate of the receiver-operating characteristic, which shows prediction performance of constraint violation score for the GWAS disease risk allele. The constraint violation score is the product of predicted variant effect and the variation potential directionality score. The median constraint violation score across all non-cancer tissue or cell types for each variant were used.

Comment in

References

    1. Pickrell JK, et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. - PMC - PubMed
    1. GTEx Consortium, T. Gte. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013;45:580–5. - PMC - PubMed
    1. Gamazon ER, et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet. 2015;47:1091–1098. - PMC - PubMed
    1. Li X, et al. The impact of rare variation on gene expression across tissues. bioRxiv. 2016 doi: 10.1101/074443. - DOI - PMC - PubMed
    1. Edwards SL, Beesley J, French JD, Dunning M. Beyond GWASs: Illuminating the dark road from association to function. American Journal of Human Genetics. 2013;93:779–797. - PMC - PubMed

Publication types