Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jun 15;26(12):1500-5.
doi: 10.1093/bioinformatics/btq206. Epub 2010 Apr 21.

Supersplat--spliced RNA-seq alignment

Affiliations

Supersplat--spliced RNA-seq alignment

Douglas W Bryant Jr et al. Bioinformatics. .

Abstract

Motivation: High-throughput sequencing technologies have recently made deep interrogation of expressed transcript sequences practical, both economically and temporally. Identification of intron/exon boundaries is an essential part of genome annotation, yet remains a challenge. Here, we present supersplat, a method for unbiased splice-junction discovery through empirical RNA-seq data.

Results: Using a genomic reference and RNA-seq high-throughput sequencing datasets, supersplat empirically identifies potential splice junctions at a rate of approximately 11.4 million reads per hour. We further benchmark the performance of the algorithm by mapping Illumina RNA-seq reads to identify introns in the genome of the reference dicot plant Arabidopsis thaliana and we demonstrate the utility of supersplat for de novo empirical annotation of splice junctions using the reference monocot plant Brachypodium distachyon.

Availability: Implemented in C++, supersplat source code and binaries are freely available on the web at http://mocklerlab-tools.cgrb.oregonstate.edu/.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Supersplat indexes a reference by starting at the first base in the reference sequence and stepping through the sequence, one base at a time. For each such stepping, b, supersplat stores each k-mer which begins at position b, where k ranges between the minimum read chunk size, c, and the MICS, i, both of which are specified by the user. In this figure's example, c is 6 and i is 11. Supersplat starts building the index by storing the first six bases of the reference, starting at the beginning of the reference, location 1, as a 6mer in the index, and associates that 6mer with a list of locations, which presently contains only location 1. Supersplat then stores the first seven bases of the reference as a 7mer in the index, and associates that 7mer with a list of locations, containing location 1. This continues until supersplat stores the first 11 bases of the reference as an 11mer, and associates that 11mer with a list of locations, containing location 1. Now that supersplat has reached k = i = 11, supersplat steps to the next base of the reference sequence, location 2. Supersplat now stores the first six bases of the reference, starting at reference location 2, as a 6mer in the index, and associates that 6mer with a list of locations, containing location 2. This process repeats until supersplat has indexed the entire reference sequence in this way.
Fig. 2.
Fig. 2.
By increasing the maximum index size, the exhaustive genome-to-reads comparisons are reduced resulting in shorter runtimes. This same increase correlates with an increase in peak RAM usage as a result of larger lookup tables.
Fig. 3.
Fig. 3.
A Venn diagram showing the comparison of supersplat predicted Brachypodium GT-AG introns against BradiV1.0 annotated GT-AG introns verified by Brachypodium ESTs. The 67 025 Brachypodium GT-AG introns (set SS) predicted by supersplat were supported by 1.55 million RNA-seq reads. The 74 786 BradiV1.0 annotated GT-AG introns (set ESTs) were verified by alignment of 2.29 million 454 reads and 128 000 Sanger reads. The 3695 introns in set HM are supersplat false negative introns that were missed by supersplat due to the minimum chunk size of 6 used in this analysis but verified as being supported by the RNA-seq data using HashMatch (Filichkin et al., 2010).
Fig. 4.
Fig. 4.
An example of filtered supersplat output displayed in GBrowse v1.69 at BrachyBase (http://www.brachybase.org). The ‘Illumina 32mer perfect match’ track represents the distribution of perfectly matching 32 nt Illumina HTS RNA-seq reads over the region. ‘HTS SuperSplat Splice Junctions’ are Illumina reads aligned using supersplat specifically to identify putative introns. The ‘TAU v1.1’ track depicts empirical transcription unit models derived from transcript data, including the splice junctions predicted by supersplat.
Fig. 5.
Fig. 5.
PPV versus minimum chunk size. As minimum chunk size is varied from 6 to 15 the precision of supersplat rapidly approaches and exceeds 90%. Here, the PPV denominator, TP + FP, ranges over 360 237 (minimum chunk size of 6) to 260 495 (minimum chunk size of 15).
Fig. 6.
Fig. 6.
PPV versus number of reads overlapping each splice junction. As the number of overlapping reads is varied from 1 to 21, the precision of supersplat rapidly approaches and exceeds 90%, reaching 97% with 21 overlapping reads. Here, the PPV denominator, TP + FP, ranges over 244 782 (single read) to 124 219 (21 overlapping reads).

References

    1. Denoeud F, et al. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008;9:R175. - PMC - PubMed
    1. De Bona F, et al. Optimal spliced alignments of short sequence reads. BMC Bioinformatics. 2008;24:i174. - PubMed
    1. Filichkin SA, et al. Genome wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010;20:45–58. - PMC - PubMed
    1. Fox S, et al. Applications of ultra high throughput sequencing in plants. Plant Syst. Biol. 2009;553:79–108. - PubMed
    1. Morgulis A, et al. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 2006;13:1028–1040. - PubMed

Publication types

LinkOut - more resources