Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 10;15(2):R34.
doi: 10.1186/gb-2014-15-2-r34.

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection

Steve Hoffmann et al. Genome Biol. .

Abstract

Numerous high-throughput sequencing studies have focused on detecting conventionally spliced mRNAs in RNA-seq data. However, non-standard RNAs arising through gene fusion, circularization or trans-splicing are often neglected. We introduce a novel, unbiased algorithm to detect splice junctions from single-end cDNA sequences. In contrast to other methods, our approach accommodates multi-junction structures. Our method compares favorably with competing tools for conventionally spliced mRNAs and, with a gain of up to 40% of recall, systematically outperforms them on reads with multiple splits, trans-splicing and circular products. The algorithm is integrated into our mapping tool segemehl (http://www.bioinf.uni-leipzig.de/Software/segemehl/).

PubMed Disclaimer

Figures

Figure 1
Figure 1
Performance of various read aligners on simulated data sets with different splice events. For simulated 454 reads (400 bp), segemehl performed significantly better in detecting conventional and ‘non-conventional’ (strand-reversing, long-range) splice junctions. segemehl was the only tool that consistently recalled more than 90% of conventional splice junctions. For ‘non-conventional’ splice events, segemehl extended its lead to 40% for recall without losing precision. Likewise, compared to three of the seven alternative tools, segemehl had a 30% increase in recall for irregularly spliced Illumina reads (100 bp). Compared to TopHat2, it had a slight increase while reporting significantly fewer false positives. At the same time, segemehl’s performance with simulated, regularly spliced Illumina reads was comparable with the other seven tools tested. gs, GSNAP; ms, MapSplice; ru, RUM; se, segemehl; sm, SpliceMap; so, SOAPsplice; st, STAR; to, TopHat2.
Figure 2
Figure 2
Recall and precision for short circular, long circular and long collinear transcripts. For this benchmark, we tested segemehl’s performance with sequence reads that were generated from the RefSeq database (A). To simulate sequencing errors, we applied an Illumina error model to the short circular reads (100 bp) and a 454 error model to the long circular and collinear transcripts (0.5 to 5 kB). For short circular transcripts, segemehl achieved a recall of more than 85%, outcompeting all other tools while maintaining a high precision of 98%. Using RefSeq transcripts of length 0.5 to 5 kB, segemehl achieved a recall of more than 80% for circular and linear transcripts. Among the tools that were able to handle such long transcripts, segemehl was the only tool that was able to detect the circularization. For long collinear transcripts, GSNAP was slightly better than segemehl by 6%, at the expense of a nearly twofold increase in runtime (Additional file 1: Table S1). (B) The RefSeq TTC22 transcript is an example of a simulated circularization. The arrow indicates where the transcript has been artificially circularized. SpliceMap, RUM and STAR did not find any circular junctions (not shown). STAR and GSNAP were the only tools able to handle long reads. gs, GSNAP; ms, MapSplice; se, segemehl; so, SOAPsplice; st, STAR; to, TopHat2.
Figure 3
Figure 3
Examples of (re-)discovered splicing events from single-end split reads.(A) For Drosophila melanogaster, segemehl recovered three different previously described splice junctions linking the minus encoded exon three of MODMDG4 on chromosome 3R to exons on the opposite strand. The strand-reversing splice junctions are annotated between the plus and minus strands. The direction of the strand-reversing splice junctions, i.e. from the minus to the plus strand, was inferred from annotation and prior knowledge. This was necessary because the RNA-seq library used was not strand specific. (B) For the human melanoma transcriptome data set, segemehl identified a very large number of strand-reversing splice junctions in the premelanosome protein (PMEL) gene locus. The split reads that support these junctions split from the plus strand to the minus strand and vice versa. Since we lack additional information, a direction for these junctions cannot be given. Only a selection of strand-reversing PMEL junctions is shown here. (C) For the same data set, segemehl found two alternative transcripts linking CDK2 and RAB5B encoded on human chromosome 12. These junctions (dashed lines) are supported by split reads whose fragments map to the same strand, i.e. split reads that were not strand-reversing. Since the junctions exactly hit the annotated borders of the CDK2 and RAB5B exons, we assigned them to the minus strand. chr, chromosome; PMEL, premelanosome protein.
Figure 4
Figure 4
Novel and known spliced transcript isoforms identified with long single-end 454 RNA-seq split reads.(A) Transcript isoforms of the p53 gene. In addition to previously reported isoforms, (i) to (iv) [24], we identified three novel canonically spliced isoforms, (v) to (vii). Consistent with [24], the β and γ isoforms were not expressed here. Each splice junction is labeled with its read support, i.e. the number of reads that map across this junction. For better comparability with [24], the p53 gene, encoded on the minus strand of chromosome 17, is shown in the direction of transcription from left to right. The junctions marked with an asterisk have been experimentally validated. (B) Unannotated transcripts in the vicinity of the TACSTD2 and MYSM1 genes recovered from a HUVEC RNA data set [27]. segemehl revealed the exon structure of two novel transcript isoforms comprising at least four exons. One exon common to both isoforms was mapped to the TACSTD2 gene. The associated introns enclose the MYSM1 gene locus. The putative gene structure is supported by three exemplary multi-split reads (not strand-reversing). Some of the splice junctions have already been reported by ENCODE/CSHL (HUVEC polyA+ RNA-seq). The strandedness of the isoforms cannot be inferred.
Figure 5
Figure 5
A chain of seeds guides a local transition alignment across multiple genomic loci. High-quality seeds mapping to different genomic loci, strands or chromosomes (A) are chained. Subsequently, the order of the seeds within the chain guides a walk through the alignment cube (B). For each genomic locus, a local alignment with the read is performed. In addition to the regular Smith-Waterman recursions, the local transition alignment allows crossing between different reference loci.

References

    1. Dorn R, Reuter G, Loewendorf A. Transgene analysis proves mRNA trans-splicing at the complex mod(mdg4) locus in Drosophila . Proc Natl Acad Sci USA. 2001;98:9724–9729. doi: 10.1073/pnas.151268698. - DOI - PMC - PubMed
    1. Frenkel-Morgenstern M, Lacroix V, Ezkurdia I, Levin Y, Gabashvili A, Prilusky J, del Pozo A, Tress M, Johnson R, Guigó R, Valencia A. Chimeras taking shape: potential functions of proteins encoded by chimeric RNA transcripts. Genome Res. 2012;22:1231–1242. doi: 10.1101/gr.130062.111. - DOI - PMC - PubMed
    1. Jeck WR, Sorrentino JA, Wang K, Slevin MK, Burd CE, Liu J, Marzluff WF, Sharpless NE. Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA. 2013;19:141–157. doi: 10.1261/rna.035667.112. - DOI - PMC - PubMed
    1. Salzman J, Gawad C, Wang PL, Lacayo N, Brown PO. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS ONE. 2012;7:30733. doi: 10.1371/journal.pone.0030733. - DOI - PMC - PubMed
    1. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, Loewer A, Ziebold U, Landthaler M, Kocks C, le Noble F, Rajewsky N. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495:333–338. doi: 10.1038/nature11928. - DOI - PubMed

Publication types

LinkOut - more resources