Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 1;7(12):giy131.
doi: 10.1093/gigascience/giy131.

Efficient and accurate detection of splice junctions from RNA-seq with Portcullis

Affiliations

Efficient and accurate detection of splice junctions from RNA-seq with Portcullis

Daniel Mapleson et al. Gigascience. .

Abstract

Next-generation sequencing technologies enable rapid and cheap genome-wide transcriptome analysis, providing vital information about gene structure, transcript expression, and alternative splicing. Key to this is the accurate identification of exon-exon junctions from RNA sequenced (RNA-seq) reads. A number of RNA-seq aligners capable of splitting reads across these splice junctions (SJs) have been developed; however, it has been shown that while they correctly identify most genuine SJs available in a given sample, they also often produce large numbers of incorrect SJs. Here, we describe the extent of this problem using popular RNA-seq mapping tools and present a new method, called Portcullis, to rapidly filter false SJs derived from spliced alignments. We show that Portcullis distinguishes between genuine and false-positive junctions to a high degree of accuracy across different species, samples, expression levels, error profiles, and read lengths. Portcullis is portable, efficient, and, to our knowledge, currently the only SJ prediction tool that reliably scales for use with large RNA-seq datasets and large, highly fragmented genomes, while delivering accurate SJs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Splice junction accuracy of STAR v2.6.0a across variations of our simulated Human dataset. (A) Scatter plot showing the effect of varying dataset size, with all datasets containing 201 bp reads. The 1.0X depth multiplier represents a dataset of ∼78 million read pairs. (B) Scatter plot showing the effect of varying read length with all datasets containing ∼30 billion base pairs.
Figure 2
Figure 2
Splice junction detection performance across mappers for 76 bp simulated paired reads. The Human dataset (A) contains 421,020,756 reads across 19,853 transcripts. The Arabidopsis dataset (B) contains 148,207,902 reads across 19,723 transcripts. The Drosophila dataset (C) contains 202,246,654 reads across 9,376 transcripts.
Figure 3
Figure 3
A five-way Venn diagram showing levels of agreement between mapping tools and the Human junction truth set with 76 bp simulated reads..
Figure 4
Figure 4
A scatter plot showing recall and precision results of all methods on our 101 bp ∼76 million simulated Human read dataset. Diagonal lines represent actual F1 score gradients. Arrows show the effect of processing BAM files by downstream junction filtering tools such as Portcullis or FineSplice. The purple TopHat2 entry shows the effect of TopHat2’s own rule-based filtering on the BAM file.
Figure 5
Figure 5
Runtimes and max memory usage of all methods on our 101 bp ∼76 million simulated Human read dataset using eight threads where appropriate. For FineSplice and Portcullis, times for alignment, sorting, and indexing are factored into the results. For memory usage, we consider alignment and filtering stages only.
Figure 6
Figure 6
In this plot, we check junctions found via each method against the reference annotation for Human (251 bp reads), Arabidopsis (151 bp reads), and Drosophila (101 bp reads), respectively. The results are categorized into the following classes: intron match; both splice sites found; one splice site found; and no splice sites found. SOAPsplice did not finish for the Human dataset, and TrueSight failed to finish successfully on all datasets due to memory demands.
Figure 7
Figure 7
Junction counts that are supported by one through six samples for wheat RNA-seq data. Solid tint indicates that junctions were found in the reference annotation; paler tint indicates junctions were not found in the reference. Junctions that occur in all six samples are more likely to be found in the reference. Average expression per junction per sample is shown by the lines and indicates that junctions found in all samples have high expression.
Figure 8
Figure 8
A high-level view of the Portcullis pipeline. Input to Portcullis is a genome in FastA format and one or more BAM files created by an upstream RNA-seq mapping tool. The first stage ensures the alignments are correctly merged, sorted, and indexed, then all junctions found in the input are analyzed and output to disk. Next, the full set of junctions is filtered to remove likely false positives and also output to disk. The user can choose to either run the full pipeline in one go or at each stage separately.
Figure 9
Figure 9
Calculating the hamming distinct of both the right-most region of the left anchor to the right-most region of the intron and the left-most region of the intron to the left-most region of the right anchor can give an indication of whether the splice site may have been incorrectly triggered by a repeat region in the genome.
Figure 10
Figure 10
An exploded view of the Portcullis filtering stage. Input is a set of junctions to filter in tab format. This pipeline first creates a model from a high confidence set of likely genuine and likely false junctions. The model is then applied to the full set of junctions and output in tab and bed format.

References

    1. Nellore A, Jaffe AE, Fortin JP et al. . Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biology. 2016;17(1). - PMC - PubMed
    1. Robert C, Fuentes-Utrilla P, Troup K et al. . Design and development of exome capture sequencing for the domestic pig (Sus scrofa). BMC Genomics. 2014;15(1): 550. - PMC - PubMed
    1. Soneson C, Delorenzi M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013;14(1): 91. - PMC - PubMed
    1. Christinat Y, Pawłowski R, Krek W. jSplice: a high-performance method for accurate prediction of alternative splicing events and its application to large-scale renal cancer transcriptome data. Bioinformatics. 2016;32(14): 2111–2119. - PubMed
    1. Conesa A, Madrigal P, Tarazona S, et al. . A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17: 13. - PMC - PubMed

Publication types

LinkOut - more resources