Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Aug;31(8):1462-1473.
doi: 10.1101/gr.274696.120. Epub 2021 Jun 15.

Assessing conservation of alternative splicing with evolutionary splicing graphs

Affiliations

Assessing conservation of alternative splicing with evolutionary splicing graphs

Diego Javier Zea et al. Genome Res. 2021 Aug.

Abstract

Understanding how protein function has evolved and diversified is of great importance for human genetics and medicine. Here, we tackle the problem of describing the whole transcript variability observed in several species by generalizing the definition of splicing graph. We provide a practical solution to construct parsimonious evolutionary splicing graphs where each node is a minimal transcript building block defined across species. We show a clear link between the functional relevance, tissue regulation, and conservation of alternative transcripts on a set of 50 genes. By scaling up to the whole human protein-coding genome, we identify a few thousand genes where alternative splicing modulates the number and composition of pseudorepeats. We have implemented our approach in ThorAxe, an efficient, versatile, robust, and freely available computational tool.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Principle of the method. (A) Two transcripts are depicted, in which each gray box represents a genomic interval and contains the corresponding protein sequence. (Below) The minimal SG is shown, with the nodes (n1, n2, n3, n4) corresponding to subexons. The start and end nodes are added for convenience. Each structural edge in red corresponds to some intron, and each induced edge in green corresponds to a junction located inside the initial genomic interval (such as the donor site of exon AEIGV). (B) Close-up view of three SGs corresponding to three orthologous genes coming from human, gorilla, and cow, along with three examples of ESGs summarizing the same information. The nodes in the ESGs represent s-exons, or multiple sequence alignments (MSAs) of exonic regions. The details of the ESG scores, computed from Equation 1, are given in the inset table, with σmatch = 1, σmismatch = −0.5, and σgap = 0 for the MSA scores, and edge penalties σS = 0.5 and σI = 2. The best-scored ESG shows at the same time compactness (parsimony) and good-quality alignments. (C) Main steps of the ESG construction in ThorAxe. The input genes and transcripts are depicted on top, with exons displayed as boxes. ThorAxe first step consists in grouping similar exons together. Here, three clusters are identified, colored in red (1), cyan (2), and blue (3); note that cluster 2 groups to multiple exons in human and cow. Then, subexons are defined based on intra-species transcript variability. For instance, the first exon from gorilla is split into two subexons. The subexons would be the nodes in the species-specific minimal SGs, although the latter are not explicitly computed by ThorAxe. The next step consists in aligning the sequences belonging to each cluster (with some padding “X” between mutually exclusive subexons) and identifying the spliced exons (s-exons) as blocks in the alignment. We keep track of the cluster IDs in the s-exon IDs, to ease interpretability. Finally, ThorAxe builds an ESG in which the nodes are the s-exons. For the sake of clarity, multiedges are visualized as single edges.
Figure 2.
Figure 2.
Conservation and tissue regulation of a set of documented AS events. (A) Each event is designated by the name of the gene where it occurs and its rank in ThorAxe output, the latter reflecting its relative conservation level. In the ESG, an event corresponds to a pair of subpaths, one being canonical and the other alternative. Within each species, either none of the paths are supported by the data (gray), or only one path is supported (light orange), or both paths are supported (orange and dark orange). As data, we consider the gene annotations from Ensembl and the RNA-seq evidence compiled from public databases. When both paths are supported, we highlight the cases in which they are differentially expressed in at least one tissue in dark orange. The white cells indicate species in which a one-to-one ortholog of the human query gene could not be found. (B) For each species, the percentages of events supported by both gene annotations and RNA-seq (in green), by only RNA-seq (in yellow), by only gene annotations (in blue), or unsupported (in gray) are reported. An event is considered to be supported only if both its canonical and alternative subpaths are detected.
Figure 3.
Figure 3.
Transcript variability in the CAMK2B linker. (A) Evolutionary splicing subgraph computed by ThorAxe starting from 63 transcripts annotated in 10 species. It corresponds to the region linking the kinase and hub domains of CAMK2B. The colors of the nodes and the edges indicate conservation levels, from yellow (low) to dark purple (high). Conservation is measured as the species fraction for the nodes (proportion of species where the s-exon is present) and as the averaged transcript fraction for the edges (averaged transcript inclusion rate of the s-exon junction). For ease of visualization, we filtered out the s-exons present in only one species. The events documented in the literature are located in the gray areas. (B) On top, genomic structure of the human gene. Each gray box corresponds to a genomic exon (nomenclature taken from Sloutsky and Stratton 2020). (Below) List of human transcripts. All of them have been described in the literature, referred to as β (Bulleit et al. 1988), βM (Bayer et al. 1998), βe (Brocke et al. 1995), β′e (Brocke et al. 1995), βe (Cook et al. 2018), α (Bulleit et al. 1988), 7 (Wang et al. 2000), and 6 (Wang et al. 2000). The functional roles of some exons (Bayer et al. 1998; Khan et al. 2019) are given. (C) Percent-spliced in (PSI) computed from RNA-seq splice junctions for the two documented AS events. The two pairs of alternative subpaths depicted on top are also highlighted on A.
Figure 4.
Figure 4.
S-exon evolutionary profiling over the whole protein-coding human genome. (A) Percentages of s-exons conserved at different evolutionary distances from human (represented by dashed vertical lines). Each curve is centered on its corresponding species. The values at the origin are the percentages of conserved (i.e., not species-specific) s-exons. Conservation is then assessed at each evolutionary distance according to the s-exons possessing at least one representative in each phylogroup. For instance, we report 73%–76% of the s-exons of frog (pink curve) as conserved among eutherians (second dashed line) in the sense that they are also conserved in at least one primate (among human, gorilla, macaque) and at least one nonprimate eutherian (among rat, mouse, boar, cow). Likely, conservation up to mammals (68%–72% for frog) would imply at least one primate, one nonprimate eutherian, and one noneutherian mammal. See also Supplemental Figure S10 for a version of this plot focusing on genes with one-to-one orthologs in more than seven species. (B) Cumulative distributions of s-exon species fraction. On the y-axis we report the percentage of s-exons with a species fraction greater than the x-axis value. The different curves correspond to all s-exons (All), only those involved in at least an event (Any event), or only those involved in a specific type of event. (Alter-S) alternative start; (Alter-I) alternative (internal); (Alter-E) alternative end; (Del) deletion; (Insert) insertion. (C) Heatmap of the s-exon phastCons median scores versus the s-exon species fractions. Only the s-exons longer than 10 residues and belonging to genes with one-to-one othologs in more than seven species are shown. (D) Proportions of conserved s-exons displaying very poor (negative score) to very good (score close to one) alignment quality. The MSA score of a s-exon is computed as a normalized sum of pairs. A score of 1 indicates 100% sequence identity without any gap. The proportions are given for different s-exon selections (same labels as in B).
Figure 5.
Figure 5.
Examples of evolutionarily conserved events with in-gene paralogy. (A,B) ESGs computed by ThorAxe (left) and the best 3D templates found by HHblits (right); PDB codes 2w49:abuv (Wu et al. 2010) and 2dfs:H (Liu et al. 2006) for TPM1 and MYO1B. On the ESGs, the colors indicate conservation levels, species fraction for the nodes and averaged transcript fraction for the edges (Supplemental Methods). The nodes in yellow are species-specific, whereas those in dark purple are present in all species. The 3D structures show complexes between the query proteins (black) and several copies of their partners (light gray). The s-exons involved in conserved events are highlighted with colored spheres. (C) S-exon consensus sequence alignments within a gene family (TPM on top, MAPK in the middle) or a gene (MYO1B, at the bottom). Each letter reported is the amino acid conserved in all sequences of the corresponding MSA (allowing one substitution). The color scheme is that of Clustal X (Thompson et al. 1997). The subgraphs show the events in which the s-exons are involved. The symbols α and β on the right indicate groups of s-exons defined across paralogous genes based on sequence similarity (Supplemental Methods). The symbols at the bottom denote highly conserved positions across the gene family: (dot) fully conserved position; (square) position conserved only within each s-exon group; (upward triangle) position conserved in the α group only; (downward triangle) position conserved in the β group only. For MYO1B, the start and canonical sequence of the CALM1-binding IQ motif are indicated. The motifs resulting from different combinations of the depicted s-exons are numbered 4, 4/5, and 4/6 in the literature (Greenberg and Ostap 2013).
Figure 6.
Figure 6.
Alternative usage of similar s-exons. (A) Evolutionary splicing subgraphs depicting different alternative usage scenarios. The detected s-exon pairs are colored in black. (MEX) mutual exclusivity; (ALT) alternative (non-mutually exclusive) usage; (REL) one s-exon is in the canonical or alternative subpath of an event (of any type), whereas the other one serves as a “canonical anchor” for the event; (UNREL) one s-exon is in the canonical or alternative subpath of an event (of any type), whereas the other one is located outside the event in the canonical transcript. Each detected pair is assigned to only one category with the following priority rule: MEX > ALT > REL > UNREL. (B) Venn diagram of the genes containing similar pairs of s-exons. The genes shown in Figure 5 are highlighted in the corresponding subsets. (C) Cumulative distributions of s-exon conservation. On the y-axis we report the percentage of s-exon pairs with species fraction greater than the x-axis value. The solid (respectively, dashed) curve corresponds to the highest (respectively, lowest) species fraction among the two s-exons in the pair. We report values only for the MEX (blue) and ALT (red) categories. (D) Distribution of per-gene s-exon pair number within each of the four categories. For instance, the yellow rectangle at x = 50 gives the number of genes with more than 10 and up to 50 UNREL s-exon pairs.

Similar articles

Cited by

References

    1. Abascal F, Tress ML, Valencia A. 2015. The evolutionary fate of alternatively spliced homologous exons after gene duplication. Genome Biol Evol 7: 1392–1403. 10.1093/gbe/evv076 - DOI - PMC - PubMed
    1. Agosto LM, Gazzara MR, Radens CM, Sidoli S, Baeza J, Garcia BA, Lynch KW. 2019. Deep profiling and custom databases improve detection of proteoforms generated by alternative splicing. Genome Res 29: 2046–2055. 10.1101/gr.248435.119 - DOI - PMC - PubMed
    1. Ait-hamlat A, Zea DJ, Labeeuw A, Polit L, Richard H, Laine E. 2020. Transcripts’ evolutionary history and structural dynamics give mechanistic insights into the functional diversity of the JNK family. J Mol Biol 432: 2121.– . 10.1016/j.jmb.2020.01.032 - DOI - PubMed
    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215: 403–410. 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Bähler M, Rhoads A. 2002. Calmodulin signaling via the IQ motif. FEBS Lett 513: 107–113. 10.1016/S0014-5793(01)03239-2 - DOI - PubMed

Publication types

LinkOut - more resources