Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jul 1;39(7):btad419.
doi: 10.1093/bioinformatics/btad419.

Counting pseudoalignments to novel splicing events

Affiliations

Counting pseudoalignments to novel splicing events

Luka Borozan et al. Bioinformatics. .

Abstract

Motivation: Alternative splicing (AS) of introns from pre-mRNA produces diverse sets of transcripts across cell types and tissues, but is also dysregulated in many diseases. Alignment-free computational methods have greatly accelerated the quantification of mRNA transcripts from short RNA-seq reads, but they inherently rely on a catalog of known transcripts and might miss novel, disease-specific splicing events. By contrast, alignment of reads to the genome can effectively identify novel exonic segments and introns. Event-based methods then count how many reads align to predefined features. However, an alignment is more expensive to compute and constitutes a bottleneck in many AS analysis methods.

Results: Here, we propose fortuna, a method that guesses novel combinations of annotated splice sites to create transcript fragments. It then pseudoaligns reads to fragments using kallisto and efficiently derives counts of the most elementary splicing units from kallisto's equivalence classes. These counts can be directly used for AS analysis or summarized to larger units as used by other widely applied methods. In experiments on synthetic and real data, fortuna was around 7× faster than traditional align and count approaches, and was able to analyze almost 300 million reads in just 15 min when using four threads. It mapped reads containing mismatches more accurately across novel junctions and found more reads supporting aberrant splicing events in patients with autism spectrum disorder than existing methods. We further used fortuna to identify novel, tissue-specific splicing events in Drosophila.

Availability and implementation: fortuna source code is available at https://github.com/canzarlab/fortuna.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
fortuna overview. Detailed description in the main text
Figure 2.
Figure 2.
In this example, transcripts t1 and t2 are assumed to be annotated, while t3 and t4 are novel. t1 and t2 partition exons into subexons s1s5. Red, dark blue, and spliced green reads are considered pairwise equivalent by kallisto, while Yanagi distinguishes red and dark blue reads in distinct counts. Gray reads from novel junction are ignored in kallisto and Yanagi. DEXSeq counts reads overlapping individual subexons, while Yanagi summarizes red and spliced green reads into a single counting bin. fortuna refines counting bins to mapping signatures, such as (2,4,5) in the case of the shaded blue read
Figure 3.
Figure 3.
Illustration of splicing event definitions. Transcripts t1,t2,t3 imply subexons s1,,s8. Subexons are colored according to the event type the corresponding (novel) splice junction defines. In this example, mapping signature (1,7) defines a classical exon skipping (ES) with respect to t3, and a nonclassical ES wrt t2. (3,5) spans an alternative donor (AD) wrt t2, (4,6) an alternative acceptor (AA) wrt t2, (5,8) an alternative donor-acceptor pair (AP) wrt t2 and an alternative acceptor wrt t3, (1,3) a novel intron in exon (IE) wrt t2, while the subexons s6 and s7 including the intron between them constitute a novel intron retention (IR)
Figure 4.
Figure 4.
Precision and recall in finding novel junctions between annotated splice sites. Results of fortuna, Whippet, STAR and STAR with two-pass mode (STAR2) are shown for the simulated dataset with 75 bp reads. Reads were split into error-free reads (upper row) and reads containing mismatches (bottom row). Results are stratified by event type (columns)
Figure 5.
Figure 5.
Running time in minutes of fortuna and competing methods on random subsamples of an ASD sample with 291 million reads
Figure 6.
Figure 6.
fortuna detects novel, tissue-specific events in Drosophila. (A) Line plot showing the number of novel events (left) or genes containing novel events (right) in indirect flight muscle (IFM, green), leg (orange) and brain (purple) samples dissected from Drosophila at 72 h after puparium formation. Samples were evaluated at various RPM thresholds. (B) Bar plot of the percent of events (RPM 0) utilizing a novel splice acceptor (SA, cyan), a novel splice donor (SD, purple) or annotated SA/SD (yellow). (C) Venn diagram of the overlap in novel events (top, black numbers) and genes containing events (bottom, blue numbers) between IFM (left circle, purple), leg (right circle, cyan) and brain (bottom circle, yellow) at RPM 1. (D) Clustering and heatmap of event RPM for the top 100 events in all three tissues. RT-PCR on IFM confirming novel events in bruno1 (bru1) (E) and bent (bt, Titin) (F). Annotated (A) and novel (N) isoform lengths in basepairs (bp), as well as exons joined by the novel events (cyan, left most boxes in N), skipped exons (red, present in A but absent in N) and primers (arrows) are illustrated. The bru1 event results in a shorter 5’-UTR on the bru1-RD mRNA isoform (coding: dark gray box, UTR: light gray boxes). The event in bt produces a shorter Projectin protein isoform lacking several Fibronectin-3 (FN3, F) and Immunoglobulin (Ig) domain repeats

References

    1. Alqassem I, Sonthalia Y, Klitzke-Feser E et al. McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data. Bioinformatics 2021;37:2004–11. - PMC - PubMed
    1. Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res 2012;22:2008–17. - PMC - PubMed
    1. Anders S, Pyl PT, Huber W. Htseq – a Python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166–9. - PMC - PubMed
    1. Beretta S, Bonizzoni P, Vedova GD et al. Modeling alternative splicing variants from RNA-seq data with isoform graphs. J Comput Biol 2014;21:16–40. - PMC - PubMed
    1. Bray NL, Pimentel H, Melsted P et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 2016;34:525–7. - PubMed

Publication types