Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct;38(18):e178.
doi: 10.1093/nar/gkq622. Epub 2010 Aug 27.

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Affiliations

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang et al. Nucleic Acids Res. 2010 Oct.

Abstract

The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (<75 bp) and long reads (≥ 75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment' phase, candidate alignments of the mRNA tags to the reference genome formula image are determined. In the ‘splice inference' phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions.
Figure 2.
Figure 2.
A portion of an mRNA transcript sampled by tag formula image consists of the 3′ end of exon 1, all of exon 2 and the 5′ end of exon 3. formula image is split into segments t1,…, tn each of length formula image to identify the alignment of formula image to the genome. Provided no exon has a length less than formula image nucleotides, at least one of every two consecutive segments must have an exonic alignment. In this example with formula image segments t1 and t3 have exonic alignment. Segment t2 has spliced alignment; the splice junction formula image can be easily discovered using the double-anchor search method starting from t1 and t3. The spliced alignment for t4 is discovered by searching downstream in the genome for an occurrence of the suffix formula image-mer of t4. When such an occurrence is found, the double-anchor search method is used to evaluate a possible splice junction formula image between formula image and the formula image-mer occurrence.
Figure 3.
Figure 3.
ROC curves for junction classification. A synthetic data set of 20M 100 bp tags was generated from transcripts selected from the ASTD database. 10K true-positive junctions and 10K false-positive junctions were selected as training data sets. Five different metrics were evaluated. They include (i) alignment quality; (ii) anchor significance; (iii) entropy; (iv) coverage; and (v) combination of metrics (i–iii). The red cross in each curve marks the point with best balance of sensitivity and specificity.
Figure 4.
Figure 4.
Sensitivity and specificity of splice inference in synthetic data sets with different characteristics. The sensitivity is the fraction of true junctions discovered among the true junctions sampled in the synthetic data. The specificity is the fraction of true junctions within the reported junctions. Since the depth of sampling is essential for the junction to be discovered, we plot the sensitivity and specificity as a function of the coverage threshold. (A) and (B) The sensitivity and specificity for perfect tags and tags seeded with sequencing errors. (C) and (D) Sensitivity and specificity compared at different tag lengths (50 bp, 75 bp and 100 bp). (E) and (F) Sensitivity and specificity compared at two different depths of sampling (10M and 20M tags, respectively).
Figure 5.
Figure 5.
Fraction of tags containing a true junction recovered (i.e. aligned to include the junction) as a function of junction coverage (defined by exponential bins). (A) TopHat recovers about 63% tags while (B) MapSplice recovers an average of 84% of the tags at each junction. The whiskers in the box plot with a recovery ratio >1 at very low coverage are due to false positives or repeats in rare cases.
Figure 6.
Figure 6.
Correlation of exon skipping ratio detected by MapSplice and Taqman. Each point represents the exon skipping ratio measured in either the MCF-7 (black) or SUM-102 (blue) cell lines.
Figure 7.
Figure 7.
Examples of alternative exon skipping events. The second exon in NUMB shows differential alternative splicing between two cancer subtypes. The exon skipping ratios in basal samples are ∼70% while in luminal samples they are <50%.
Figure 8.
Figure 8.
Clustering of tumor subtypes with skipping ratios of alternative exon skipping events. One hundred twenty-nine alternative exon skipping events with minimum junction support of at least three for each sample were selected. (A) Heatmap (red to blue scale) of skipping ratios, where each row corresponds to one distinct exon skipping event and each column represents a single sample. We performed hierarchical clustering on both the rows and columns. The dendrograms are shown on the left and top of the heatmap, respectively. (B) We applied principal component analysis (PCA) on the correlation matrix of the eight samples. The scatter plot shows the relative position of the eight samples in the 2D space formed by the first principal component and the second principal component. The plot shows good separation between two cancer subtypes along the second principal component. (C) We applied an ANOVA test on the skipping ratio matrix in (A). We selected 12 events that significantly differentiate between the two tumor subtypes with a formula image0.001. The matrix of their skipping ratios are shown in the heatmap. Both rows and columns were clustered. (D) A scatter plot of the eight samples along the first and second principal components generated from the PCA of the correlation distance matrix of the eight samples based on the 11 selected events.

Similar articles

Cited by

References

    1. Wang ET, Sandberg R, Luo SJ, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. - PMC - PubMed
    1. Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. Regulation of alternative splicing by histone modifications. Science. 2010;327:996–1000. - PMC - PubMed
    1. Andersen LB, Ballester R, Marchuk DA, Chang E, Gutmann DH, Saulino AM, Camonis J, Wigler M, Collins FS. A conserved alternative splice in the von Recklinghausen neurofibromatosis (NF1) gene produces two neurofibromin isoforms, both of which have GTPase-activating protein activity. Mol. Cell. Biol. 1993;13:487–495. - PMC - PubMed
    1. Screaton GR, Bell MV, Jackson DG, Cornelis FB, Gerth U, Bell JI. Genomic structure of DNA encoding the lymphocyte homing receptor CD44 reveals at least 12 alternatively spliced exons. Proc. Natl Acad. Sci. USA. 1992;89:12160–12164. - PMC - PubMed
    1. Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, Hudson TJ, Sladek R, Majewski J. Genome-wide analysis of transcript isoform variation in humans. Nat. Genet. 2008;40:225–231. - PubMed

Publication types

Substances