Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 3;73(1):183-194.e8.
doi: 10.1016/j.molcel.2018.10.037. Epub 2018 Nov 29.

A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions

Affiliations

A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions

Rockie Chong et al. Mol Cell. .

Abstract

Mutations that lead to splicing defects can have severe consequences on gene function and cause disease. Here, we explore how human genetic variation affects exon recognition by developing a multiplexed functional assay of splicing using Sort-seq (MFASS). We assayed 27,733 variants in the Exome Aggregation Consortium (ExAC) within or adjacent to 2,198 human exons in the MFASS minigene reporter and found that 3.8% (1,050) of variants, most of which are extremely rare, led to large-effect splice-disrupting variants (SDVs). Importantly, we find that 83% of SDVs are located outside of canonical splice sites, are distributed evenly across distinct exonic and intronic regions, and are difficult to predict a priori. Our results indicate extant, rare genetic variants can have large functional effects on splicing at appreciable rates, even outside the context of disease, and MFASS enables their empirical assessment at scale.

Keywords: exon recognition; massively parallel reporter assay; population variation; rare variation; splicing; variant classification.

PubMed Disclaimer

Conflict of interest statement

DECLARATION OF INTERESTS

The authors declare no competing interests.

Figures

Figure 1.
Figure 1.. Multiplexed Functional Assay of Splicing using Sort-Seq
(A) We cloned synthetic human exons (black) and surrounding intronic sequences (dark gray) into our reporter plasmid containing a split-GFP reporter with flanking constant intron backbones (light gray), followed by site-specific integration into HEK293T cells using Bxb1 integrase. Cells are sorted into bins based on fluorescence, followed by amplicon sequencing of DNA from cells in each sorted bin. We calculated exon inclusion index for each sequence using a weighted average of normalized read counts based on exon inclusion level from bins (STAR Methods). (B) We used FACS to sort the genomically integrated SRE library into three separate populations (left). After expansion, the sorted populations remained stable(right). GFP-int, GFP-intermediate. For this library (SMN1 intron backbone), we obtained ~4 million cells for GFP+ and GFPneg bins and 4.2 × 105 cells for GFPint bin. The percentage of cells sorted is as follows: GFP+ (33.3%); GFPneg (44.5%); and GFPint (5.6%). (C) The observed RNA splicing efficiencies of the sorted bins as measured by RT-PCR correspond almost directly with observed fluorescence of the bins. (D) We plotted the percentage of reads for each construct in the SRE library containing both natural and mutant exons (n = 10,477). We showed that most sequences fall predominantly into one bin, exhibiting either complete exon skipping or inclusion, allowing for facile classification of exon skipping variants of large effects (Δinclusion index %0.5). Corresponding exon inclusion indices for each bin are indicated at top panel. The data shown in (D) correspond to the SMN1 backbone. (E–G) SRE library splicing behavior replicates between individual biological replicates and across two constant intron backbones. Tetrachoric correlation indicates whether two distinct measurements are concordant in one of the four quadrants and is more suited to assess large-effect variants. (E) Exon inclusion indices show strong correlation between two independent biological replicates for C. griseus DHFR intron backbone (rt = 1.00, p < 1016, tetrachoric; r = 0.94, p < 1016, Pearson). (F) Exon inclusion indices show strong correlation between two independent biological replicates for human SMN1 intron backbone (rt = 0.97, p < 1016, tetrachoric; r = 0.89, p < 1016, Pearson). For (E) and (F), after calculation of correlation coefficients, sequences for which inclusion indices do not agree within 0.30 (outside the dashed lines) are excluded from subsequent analysis. (G) Results are robust across different intron backbones (rt = 0.96, p < 1016, tetrachoric; r = 0.85, p < 1016, Pearson). See also Figure S1.
Figure 2.
Figure 2.. Effects on Exon Recognition Are Not Easily Predicted across 6,713 Designed Mutations in Splicing Regulatory Elements
(A) We quantitatively measured exon inclusion for iteratively designed mutations (n = 6,713) across categories of splicing regulatory elements from 205 human exons (see Tables S1 and S2 for categorical explanations and definitions). We defined SDVs as variants that result in a Δinclusion index ≤ −0.5, relative to the wild-type sequence (STAR Methods). We only consider SNVs when the corresponding wild-type sequence is also detected, requiring that the wild-type exons demonstrate inclusion in our assay (inclusion index of ≥0.5) for variants to be considered an SDV. Here, we highlight the data for the SMN1 intron backbone and detected 21.3% (1,428/6,713) of variants as SDVs across all categories. See also Figure S2A for mutations to exons that are skipped in MFASS (inclusion index of <0.5) across designed categories. Splice acceptor, positions −20 to +3; splice donor, positions −3 to +6. (B) Mutating the splice acceptor and splice donor sites adversely affects exon inclusion based on MaxEnt prediction for included exons (inclusion index of ≥0.5; Yeo and Burge, 2004). (C) Decreasing overall exon hexamer score leads to more exon skipping. Hexamer scores are based on the HAL model (Rosenberg et al., 2015). An alternative score metric is evaluated in Figure S2B (Ke et al., 2011). ***p < 0.001. See also Figure S2 and Tables S1 and S2.
Figure 3.
Figure 3.. MFASS Enables Functional Characterization of Variant Effect on Splicing at Scale across Libraries of Human Exons and Variants
(A) The number of SNVs per exon sequence (top) and the Δinclusion index (bottom) of the 27,733 ExAC SNVs are plotted against the wild-type exon backgrounds (n = 2,198) and colored by the inclusion index of the corresponding wild-type (WT) sequence. Both the top and bottom panels are ordered in decreasing number of variants tested from 44 to 1 per human exon background, with an average of 12.6 human variants and 3.8 SDVs per assayed wild-type exon sequence background. We found 1,050 of 27,733 SNVs tested (3.8%) are SDVs (Δinclusion index ≤ −0.5) and are broadly spread across the 543 human exon backgrounds in 473 genes. Dashed line indicates the threshold (Δinclusion index = −0.5), below which we call SDVs. (B) The change in inclusion index as a function of relative position for our SNV library across 2,198 human exon sequences shows that the splice donor and acceptor sites are most sensitive to mutations. Intron-exon boundary on the left corresponds to the splice acceptor, and the intron-exon boundary on the right corresponds to the splice donor. The splice donor is more sensitive to mutation because its consensus site is longer and more conserved. The bottom panel displays the relative sensitivity of each position. Each bin corresponds to 1 or 2 nucleotides per position, and locations are relative as we test a range of exon lengths. (C) Three control sets for validating the SNV library (n = 1,072). Most control sequences that were designed to cause exon skipping led to almost complete loss of exon recognition. The three control sets were (1) scrambled sequences (n = 24), (2) a previously tested subset of exons that were skipped in the SRE library (n = 71), and (3) breakage of the splice sites (n = 977). The broken splice-signal control library mutates 5 splice sites (SD) at the downstream intron from GT to CC and 3 splice sites (SA) at the upstream intron from AG to TT. SA, splice acceptor; SD, splice donor. We included the distribution of wild-type sequences (i.e., natural exons; n = 2,339, of which 2,198 sequences have relevant SNV data; STAR Methods). These exons initially demonstrated exon inclusion in the SRE library (inclusion index ≥ 0.8), and we subsequently retested them and their associated SNVs in the SNV library. (D) We analyzed the effects of single-base deletions derived from synthetic errors on exon inclusion. We showed the effect of exon inclusion for synthetic deletions (n = 9,801) across replicates, with an SDV rate of 3.59%. We observed an enrichment of SDVs at or near the splice sites. (E) We validated large-effect rare variants detected by MFASS (n = 34) and their corresponding wild-type sequences. We measured exon inclusion in either the original sequence context examined in MFASS (n = 11) or as a more stringent test with an additional 130 bp of longer intronic contexts (n = 23) in HEK293T cells. For the longer set, we tested SDVs that represent variant classes in Figure 4B: missense variants (n = 3); synonymous variants (n = 3); intron variants (n = 4); splice donor (n = 4); splice acceptor (n = 5); and splice region variants (n = 4). The levels of exon inclusion were calculated for both the individual SDV and its respective wild-type sequence. All mutants were normalized to a no-insert control as a baseline of complete exon skipping for the assessment of change in exon inclusion. Dashed line indicates the threshold (D% inclusion = 50%), below which we call splicing-disrupting variants (SDVs). (F) To examine the cell-type specificity of SDVs, we further picked a subset of 15 SDVs from the long intronic context with the strongest change in inclusion levels for testing their effects across 4 cell types and validated reporter constructs for WT or the corresponding SDVs. n = WT, SDV: 14,15 (HEK293T), 14,15 (HeLa S3), 14,15 (HepG2), 14,15 (K562). We found that large-effect splicing disruptions are consistent across 4 cell types in all 15 of the splice-disrupting variants assayed (15 of 15; 100%). The generalizability of per variant exon inclusion measurements across cell types is included in Figure S3E. See also Figure S3.
Figure 4.
Figure 4.. Global Analysis of Splice-Disrupting Variants across 27,733 ExAC SNVs in or near 2,198 Human Exons
(A) We functionally classified our variants by variant class from the Ensembl variant effect predictor (STAR Methods). SDVs (n = 1,050) from natural genetic variation are split almost equally between exonic and intronic regions (blue and red, respectively). Dashed line separates the exonic regions (4%) and intronic regions (17%) of the splice region variants. Splice site variants are defined as those within 2 bp of intron adjacent to exon, whereas splice region variants are located 3 bp into the exon and 8 bp into the intron, excluding splice sites. (B) Splice site mutations are by far the most likely region to result in an SDV (left). However, because SNVs at splice sites are relatively rare, SDVs in regions other than the splice site constitute 83% of all SDVs (right). The distributions for non-SDVs across variant classes and the distribution of SDV effect sizes are shown in Figure S4A. (C) The percentage of SDVs as a function of position along the exon and surrounding intron sequence shows that splice donor regions are more sensitive Than splice acceptor regions (top panel).Plotted below is the average change in mammalian evolutionary conservation (phyloP score averages) and ExAC SNV density as a function of location. Each bin corresponds to 1 to 2 nucleotides per position, and locations are relative to account for variable exon length. See also Figure S4.
Figure 5.
Figure 5.. Population Genetics, Evolutionary, and Functional Analyses of SDVs across 27,733 ExAC SNVs
(A) The percentage of SDVs as a function of allele frequency shows significant reductions across allele frequencies from the Genome Aggregation Database (gnomAD) (chi-square test; p = 1.03 × 10−4). A vast majority (97.9%) of the ExAC variants assayed were rare (gnomAD global minor allele frequencies [MAF] ≤ 0.5%). Allele frequencies are not available for 2,460 variants because of insufficient coverage in gnomAD. (B) We analyzed the proportion of SDVs and PTVs in genes predicted to be intolerant to loss-of-function alleles (pLI ≥ 0.9) and tolerant genes. We observe both significantly fewer SDVs (two-tailed Fisher’s exact test; p = 3.03 × 10−11) and significant fewer PTVs (two-tailed Fisher’s exact test; p < 10−16) for exons within intolerant genes. Dashed lines mark the overall percentage of SDVs (3.8%) and PTVs (1.2%) in our dataset without considering the pLI metric. (C) SDVs are under stronger evolutionary conservation as evidenced by higher overall phyloP scores (Mann-Whitney U test; p < 10−16). (D) Within introns, we found that positions that are evolutionarily conserved (deleterious; phyloP > 2.0; purple) have a higher SDV rate than those under neutral (−1.2 ≤ phyloP ≤ 1.2; blue) or accelerating selection (phyloP < −2.0; green; two-tailed Fisher’s exact test; p < 10−16). (E) There are more SNVs outside of regions of high intron conservation, which leads to many SDVs located within nucleotides that display neutral selection. (F) We observed a significantly higher negative maximum change in predicted exonic hexamer scores within exonic SDVs than non-SDVs (Student’s t test; p < 10−16). ***p < 0.001. See also Figure S5.
Figure 6.
Figure 6.. Evaluation of Genomic and DeepLearning Predictors for Rare Variation on Splicing
(A) Functional prediction from SIFT and PolyPhen for missense SDVs (n = 250) show few are predicted to be loss-of-function variants. The distributions for missense non-SDVs for SIFT and PolyPhen are shown in Figure S6A. (B) Precision-recall curves for algorithms that can predict splicing or non-coding genetic variants. Dashed line represents the overall percentage of SDVs (3.8%) from MFASS. Corresponding receiver operating characteristic (ROC) curves are shown in Figure S6B. (C) Precision-recall curves for algorithms that can predict splicing or non-coding genetic variants, focusing on either intronic or exonic variants only. See also Figure S6.

References

    1. Adamson SI, Zhan L, and Graveley BR (2018). Vex-seq: high-throughput identification of the impact of genetic variation on pre-mRNA splicing efficiency. Genome Biol. 19, 71. - PMC - PubMed
    1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, and Sunyaev SR (2010). A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249. - PMC - PubMed
    1. Altshuler D, Daly MJ, and Lander ES (2008). Genetic mapping in human disease. Science 322, 881–888. - PMC - PubMed
    1. Arias MA, Lubkin A, and Chasin LA (2015). Splicing of designer exons informs a biophysical model for exon definition. RNA 21, 213–229. - PMC - PubMed
    1. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, and Abecasis GR; 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature 526, 68–74. - PMC - PubMed

Publication types