. 2023 Jul 1;39(7):btad419.

doi: 10.1093/bioinformatics/btad419.

Counting pseudoalignments to novel splicing events

Luka Borozan¹, Francisca Rojas Ringeling^{2

3}, Shao-Yen Kao⁴, Elena Nikonova⁴, Pablo Monteagudo-Mesas², Domagoj Matijević¹, Maria L Spletter^{4

5}, Stefan Canzar^{2

3

6}

Affiliations

¹ Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia.
² Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany.
³ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States.
⁴ Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany.
⁵ School of Science and Engineering, Division of Biological & Biomedical Systems, University of Missouri Kansas City, Kansas City, MO 64110, United States.
⁶ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States.

PMID: 37432342
PMCID: PMC10348833
DOI: 10.1093/bioinformatics/btad419

Counting pseudoalignments to novel splicing events

Luka Borozan et al. Bioinformatics. 2023.

. 2023 Jul 1;39(7):btad419.

doi: 10.1093/bioinformatics/btad419.

Authors

Luka Borozan¹, Francisca Rojas Ringeling^{2

3}, Shao-Yen Kao⁴, Elena Nikonova⁴, Pablo Monteagudo-Mesas², Domagoj Matijević¹, Maria L Spletter^{4

5}, Stefan Canzar^{2

3

6}

Affiliations

¹ Department of Mathematics, Josip Juraj Strossmayer University of Osijek, Osijek 31000, Croatia.
² Gene Center, Ludwig-Maximilians-Universität München, Munich 81377, Germany.
³ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, United States.
⁴ Biomedical Center, Department of Physiological Chemistry, Ludwig-Maximilians-Universität München, Planegg-Martinsried 82152, Germany.
⁵ School of Science and Engineering, Division of Biological & Biomedical Systems, University of Missouri Kansas City, Kansas City, MO 64110, United States.
⁶ Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, United States.

PMID: 37432342
PMCID: PMC10348833
DOI: 10.1093/bioinformatics/btad419

Abstract

Motivation: Alternative splicing (AS) of introns from pre-mRNA produces diverse sets of transcripts across cell types and tissues, but is also dysregulated in many diseases. Alignment-free computational methods have greatly accelerated the quantification of mRNA transcripts from short RNA-seq reads, but they inherently rely on a catalog of known transcripts and might miss novel, disease-specific splicing events. By contrast, alignment of reads to the genome can effectively identify novel exonic segments and introns. Event-based methods then count how many reads align to predefined features. However, an alignment is more expensive to compute and constitutes a bottleneck in many AS analysis methods.

Results: Here, we propose fortuna, a method that guesses novel combinations of annotated splice sites to create transcript fragments. It then pseudoaligns reads to fragments using kallisto and efficiently derives counts of the most elementary splicing units from kallisto's equivalence classes. These counts can be directly used for AS analysis or summarized to larger units as used by other widely applied methods. In experiments on synthetic and real data, fortuna was around 7× faster than traditional align and count approaches, and was able to analyze almost 300 million reads in just 15 min when using four threads. It mapped reads containing mismatches more accurately across novel junctions and found more reads supporting aberrant splicing events in patients with autism spectrum disorder than existing methods. We further used fortuna to identify novel, tissue-specific splicing events in Drosophila.

Availability and implementation: fortuna source code is available at https://github.com/canzarlab/fortuna.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
fortuna overview. Detailed description in the main text

**Figure 2.**
In this example, transcripts $t_{1}$ and $t_{2}$ are assumed to be annotated, while $t_{3}$ and $t_{4}$ are novel. $t_{1}$ and $t_{2}$ partition exons into subexons $s_{1}$ – $s_{5}$ . Red, dark blue, and spliced green reads are considered pairwise equivalent by kallisto, while Yanagi distinguishes red and dark blue reads in distinct counts. Gray reads from novel junction are ignored in kallisto and Yanagi. DEXSeq counts reads overlapping individual subexons, while Yanagi summarizes red and spliced green reads into a single counting bin. fortuna refines counting bins to mapping signatures, such as $(2, 4, 5)$ in the case of the shaded blue read

**Figure 3.**
Illustration of splicing event definitions. Transcripts $t_{1}, t_{2}, t_{3}$ imply subexons $s_{1}, \dots, s_{8}$ . Subexons are colored according to the event type the corresponding (novel) splice junction defines. In this example, mapping signature $(1, 7)$ defines a classical exon skipping (ES) with respect to $t_{3}$ , and a nonclassical ES wrt $t_{2}$ . $(3, 5)$ spans an alternative donor (AD) wrt $t_{2}$ , $(4, 6$ ) an alternative acceptor (AA) wrt $t_{2}$ , $(5, 8)$ an alternative donor-acceptor pair (AP) wrt $t_{2}$ and an alternative acceptor wrt $t_{3}$ , $(1, 3)$ a novel intron in exon (IE) wrt $t_{2}$ , while the subexons $s_{6}$ and $s_{7}$ including the intron between them constitute a novel intron retention (IR)

**Figure 4.**
Precision and recall in finding novel junctions between annotated splice sites. Results of fortuna, Whippet, STAR and STAR with two-pass mode (STAR2) are shown for the simulated dataset with 75 bp reads. Reads were split into error-free reads (upper row) and reads containing mismatches (bottom row). Results are stratified by event type (columns)

**Figure 5.**
Running time in minutes of fortuna and competing methods on random subsamples of an ASD sample with 291 million reads

**Figure 6.**
fortuna detects novel, tissue-specific events in *Drosophila*. (A) Line plot showing the number of novel events (left) or genes containing novel events (right) in indirect flight muscle (IFM, green), leg (orange) and brain (purple) samples dissected from *Drosophila* at 72 h after puparium formation. Samples were evaluated at various RPM thresholds. (B) Bar plot of the percent of events (RPM $\geq$ 0) utilizing a novel splice acceptor (SA, cyan), a novel splice donor (SD, purple) or annotated SA/SD (yellow). (C) Venn diagram of the overlap in novel events (top, black numbers) and genes containing events (bottom, blue numbers) between IFM (left circle, purple), leg (right circle, cyan) and brain (bottom circle, yellow) at RPM $\geq$ 1. (D) Clustering and heatmap of event RPM for the top 100 events in all three tissues. RT-PCR on IFM confirming novel events in *bruno1* (*bru1*) (E) and *bent* (bt, Titin) (F). Annotated (A) and novel (N) isoform lengths in basepairs (bp), as well as exons joined by the novel events (cyan, left most boxes in N), skipped exons (red, present in A but absent in N) and primers (arrows) are illustrated. The *bru1* event results in a shorter 5’-UTR on the *bru1-RD* mRNA isoform (coding: dark gray box, UTR: light gray boxes). The event in bt produces a shorter Projectin protein isoform lacking several Fibronectin-3 (FN3, F) and Immunoglobulin (Ig) domain repeats

See this image and copyright information in PMC

References

1. Alqassem I, Sonthalia Y, Klitzke-Feser E et al. McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data. Bioinformatics 2021;37:2004–11. - PMC - PubMed
1. Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Res 2012;22:2008–17. - PMC - PubMed
1. Anders S, Pyl PT, Huber W. Htseq – a Python framework to work with high-throughput sequencing data. Bioinformatics 2015;31:166–9. - PMC - PubMed
1. Beretta S, Bonizzoni P, Vedova GD et al. Modeling alternative splicing variants from RNA-seq data with isoform graphs. J Comput Biol 2014;21:16–40. - PMC - PubMed
1. Bray NL, Pimentel H, Melsted P et al. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 2016;34:525–7. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Consumer Health Information
- MedlinePlus Health Information
Molecular Biology Databases
- FlyBase
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Counting pseudoalignments to novel splicing events

Affiliations

Counting pseudoalignments to novel splicing events

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Medical

Molecular Biology Databases

Research Materials