Supersplat--spliced RNA-seq alignment

Douglas W Bryant Jr¹, Rongkun Shen, Henry D Priest, Weng-Keen Wong, Todd C Mockler

Affiliations

PMID: 20410051
PMCID: PMC2881391
DOI: 10.1093/bioinformatics/btq206

Supersplat--spliced RNA-seq alignment

Douglas W Bryant Jr et al. Bioinformatics. 2010.

. 2010 Jun 15;26(12):1500-5.

doi: 10.1093/bioinformatics/btq206. Epub 2010 Apr 21.

Authors

Douglas W Bryant Jr¹, Rongkun Shen, Henry D Priest, Weng-Keen Wong, Todd C Mockler

Affiliation

¹ Department of Botany and Plant Pathology and Center for Genome Research and Biocomputing, Oregon State University, Corvallis, OR 97331, USA.

PMID: 20410051
PMCID: PMC2881391
DOI: 10.1093/bioinformatics/btq206

Abstract

Motivation: High-throughput sequencing technologies have recently made deep interrogation of expressed transcript sequences practical, both economically and temporally. Identification of intron/exon boundaries is an essential part of genome annotation, yet remains a challenge. Here, we present supersplat, a method for unbiased splice-junction discovery through empirical RNA-seq data.

Results: Using a genomic reference and RNA-seq high-throughput sequencing datasets, supersplat empirically identifies potential splice junctions at a rate of approximately 11.4 million reads per hour. We further benchmark the performance of the algorithm by mapping Illumina RNA-seq reads to identify introns in the genome of the reference dicot plant Arabidopsis thaliana and we demonstrate the utility of supersplat for de novo empirical annotation of splice junctions using the reference monocot plant Brachypodium distachyon.

Availability: Implemented in C++, supersplat source code and binaries are freely available on the web at http://mocklerlab-tools.cgrb.oregonstate.edu/.

PubMed Disclaimer

Figures

**Fig. 1.**
Supersplat indexes a reference by starting at the first base in the reference sequence and stepping through the sequence, one base at a time. For each such stepping, b, supersplat stores each k-mer which begins at position b, where k ranges between the minimum read chunk size, c, and the MICS, i, both of which are specified by the user. In this figure's example, c is 6 and i is 11. Supersplat starts building the index by storing the first six bases of the reference, starting at the beginning of the reference, location 1, as a 6mer in the index, and associates that 6mer with a list of locations, which presently contains only location 1. Supersplat then stores the first seven bases of the reference as a 7mer in the index, and associates that 7mer with a list of locations, containing location 1. This continues until supersplat stores the first 11 bases of the reference as an 11mer, and associates that 11mer with a list of locations, containing location 1. Now that supersplat has reached k = i = 11, supersplat steps to the next base of the reference sequence, location 2. Supersplat now stores the first six bases of the reference, starting at reference location 2, as a 6mer in the index, and associates that 6mer with a list of locations, containing location 2. This process repeats until supersplat has indexed the entire reference sequence in this way.

**Fig. 2.**
By increasing the maximum index size, the exhaustive genome-to-reads comparisons are reduced resulting in shorter runtimes. This same increase correlates with an increase in peak RAM usage as a result of larger lookup tables.

**Fig. 3.**
A Venn diagram showing the comparison of supersplat predicted *Brachypodium* GT-AG introns against BradiV1.0 annotated GT-AG introns verified by *Brachypodium* ESTs. The 67 025 *Brachypodium* GT-AG introns (set SS) predicted by supersplat were supported by 1.55 million RNA-seq reads. The 74 786 BradiV1.0 annotated GT-AG introns (set ESTs) were verified by alignment of 2.29 million 454 reads and 128 000 Sanger reads. The 3695 introns in set HM are supersplat false negative introns that were missed by supersplat due to the minimum chunk size of 6 used in this analysis but verified as being supported by the RNA-seq data using HashMatch (Filichkin *et al.*, 2010).

**Fig. 4.**
An example of filtered supersplat output displayed in GBrowse v1.69 at BrachyBase (http://www.brachybase.org). The ‘Illumina 32mer perfect match’ track represents the distribution of perfectly matching 32 nt Illumina HTS RNA-seq reads over the region. ‘HTS SuperSplat Splice Junctions’ are Illumina reads aligned using supersplat specifically to identify putative introns. The ‘TAU v1.1’ track depicts empirical transcription unit models derived from transcript data, including the splice junctions predicted by supersplat.

**Fig. 5.**
PPV versus minimum chunk size. As minimum chunk size is varied from 6 to 15 the precision of supersplat rapidly approaches and exceeds 90%. Here, the PPV denominator, TP + FP, ranges over 360 237 (minimum chunk size of 6) to 260 495 (minimum chunk size of 15).

**Fig. 6.**
PPV versus number of reads overlapping each splice junction. As the number of overlapping reads is varied from 1 to 21, the precision of supersplat rapidly approaches and exceeds 90%, reaching 97% with 21 overlapping reads. Here, the PPV denominator, TP + FP, ranges over 244 782 (single read) to 124 219 (21 overlapping reads).

See this image and copyright information in PMC

References

1. Denoeud F, et al. Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008;9:R175. - PMC - PubMed
1. De Bona F, et al. Optimal spliced alignments of short sequence reads. BMC Bioinformatics. 2008;24:i174. - PubMed
1. Filichkin SA, et al. Genome wide mapping of alternative splicing in Arabidopsis thaliana. Genome Res. 2010;20:45–58. - PMC - PubMed
1. Fox S, et al. Applications of ultra high throughput sequencing in plants. Plant Syst. Biol. 2009;553:79–108. - PubMed
1. Morgulis A, et al. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J. Comput. Biol. 2006;13:1028–1040. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Supersplat--spliced RNA-seq alignment

Affiliation

Supersplat--spliced RNA-seq alignment

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources