. 2010 Oct;38(18):e178.

doi: 10.1093/nar/gkq622. Epub 2010 Aug 27.

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang¹, Darshan Singh, Zheng Zeng, Stephen J Coleman, Yan Huang, Gleb L Savich, Xiaping He, Piotr Mieczkowski, Sara A Grimm, Charles M Perou, James N MacLeod, Derek Y Chiang, Jan F Prins, Jinze Liu

Affiliations

PMID: 20802226
PMCID: PMC2952873
DOI: 10.1093/nar/gkq622

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Kai Wang et al. Nucleic Acids Res. 2010 Oct.

. 2010 Oct;38(18):e178.

doi: 10.1093/nar/gkq622. Epub 2010 Aug 27.

Authors

Affiliation

¹ Department of Computer Science, University of Kentucky, Lexington, KY 40506, USA.

PMID: 20802226
PMCID: PMC2952873
DOI: 10.1093/nar/gkq622

Abstract

The accurate mapping of reads that span splice junctions is a critical component of all analytic techniques that work with RNA-seq data. We introduce a second generation splice detection algorithm, MapSplice, whose focus is high sensitivity and specificity in the detection of splices as well as CPU and memory efficiency. MapSplice can be applied to both short (<75 bp) and long reads (≥ 75 bp). MapSplice is not dependent on splice site features or intron length, consequently it can detect novel canonical as well as non-canonical splices. MapSplice leverages the quality and diversity of read alignments of a given splice to increase accuracy. We demonstrate that MapSplice achieves higher sensitivity and specificity than TopHat and SpliceMap on a set of simulated RNA-seq data. Experimental studies also support the accuracy of the algorithm. Splice junctions derived from eight breast cancer RNA-seq datasets recapitulated the extensiveness of alternative splicing on a global level as well as the differences between molecular subtypes of breast cancer. These combined results indicate that MapSplice is a highly accurate algorithm for the alignment of RNA-seq reads to splice junctions. Software download URL: http://www.netlab.uky.edu/p/bioinfo/MapSplice.

PubMed Disclaimer

Figures

**Figure 1.**
An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment' phase, candidate alignments of the mRNA tags to the reference genome are determined. In the ‘splice inference' phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions.

formula image — **Figure 1.**
An overview of the MapSplice pipeline. The algorithm contains two phases: tag alignment (Step 1–Step 4) and splice inference (Step 5–Step 6). In the ‘tag alignment' phase, candidate alignments of the mRNA tags to the reference genome are determined. In the ‘splice inference' phase, splice junctions that appear in one or more tag alignments are analyzed to determine a splice significance score based on the quality and diversity of alignments that include the splice. Ambiguous candidate alignments are resolved by selecting the alignment with the overall highest quality match and highest confidence splice junctions.

**Figure 2.**
A portion of an mRNA transcript sampled by tag consists of the 3′ end of exon 1, all of exon 2 and the 5′ end of exon 3. is split into segments t₁,…, *t_n* each of length to identify the alignment of to the genome. Provided no exon has a length less than nucleotides, at least one of every two consecutive segments must have an exonic alignment. In this example with segments t₁ and t₃ have exonic alignment. Segment t₂ has spliced alignment; the splice junction can be easily discovered using the double-anchor search method starting from t₁ and t₃. The spliced alignment for t₄ is discovered by searching downstream in the genome for an occurrence of the suffix -mer of t₄. When such an occurrence is found, the double-anchor search method is used to evaluate a possible splice junction between and the -mer occurrence.

**Figure 3.**
ROC curves for junction classification. A synthetic data set of 20M 100 bp tags was generated from transcripts selected from the ASTD database. 10K true-positive junctions and 10K false-positive junctions were selected as training data sets. Five different metrics were evaluated. They include (i) alignment quality; (ii) anchor significance; (iii) entropy; (iv) coverage; and (v) combination of metrics (i–iii). The red cross in each curve marks the point with best balance of sensitivity and specificity.

**Figure 4.**
Sensitivity and specificity of splice inference in synthetic data sets with different characteristics. The sensitivity is the fraction of true junctions discovered among the true junctions sampled in the synthetic data. The specificity is the fraction of true junctions within the reported junctions. Since the depth of sampling is essential for the junction to be discovered, we plot the sensitivity and specificity as a function of the coverage threshold. (A) and (B) The sensitivity and specificity for perfect tags and tags seeded with sequencing errors. (C) and (D) Sensitivity and specificity compared at different tag lengths (50 bp, 75 bp and 100 bp). (E) and (F) Sensitivity and specificity compared at two different depths of sampling (10M and 20M tags, respectively).

**Figure 5.**
Fraction of tags containing a true junction recovered (i.e. aligned to include the junction) as a function of junction coverage (defined by exponential bins). (A) TopHat recovers about 63% tags while (B) MapSplice recovers an average of 84% of the tags at each junction. The whiskers in the box plot with a recovery ratio >1 at very low coverage are due to false positives or repeats in rare cases.

**Figure 6.**
Correlation of exon skipping ratio detected by MapSplice and Taqman. Each point represents the exon skipping ratio measured in either the MCF-7 (black) or SUM-102 (blue) cell lines.

**Figure 7.**
Examples of alternative exon skipping events. The second exon in NUMB shows differential alternative splicing between two cancer subtypes. The exon skipping ratios in basal samples are ∼70% while in luminal samples they are <50%.

**Figure 8.**
Clustering of tumor subtypes with skipping ratios of alternative exon skipping events. One hundred twenty-nine alternative exon skipping events with minimum junction support of at least three for each sample were selected. (A) Heatmap (red to blue scale) of skipping ratios, where each row corresponds to one distinct exon skipping event and each column represents a single sample. We performed hierarchical clustering on both the rows and columns. The dendrograms are shown on the left and top of the heatmap, respectively. (B) We applied principal component analysis (PCA) on the correlation matrix of the eight samples. The scatter plot shows the relative position of the eight samples in the 2D space formed by the first principal component and the second principal component. The plot shows good separation between two cancer subtypes along the second principal component. (C) We applied an ANOVA test on the skipping ratio matrix in (A). We selected 12 events that significantly differentiate between the two tumor subtypes with a 0.001. The matrix of their skipping ratios are shown in the heatmap. Both rows and columns were clustered. (D) A scatter plot of the eight samples along the first and second principal components generated from the PCA of the correlation distance matrix of the eight samples based on the 11 selected events.

See this image and copyright information in PMC

References

1. Wang ET, Sandberg R, Luo SJ, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. - PMC - PubMed
1. Luco RF, Pan Q, Tominaga K, Blencowe BJ, Pereira-Smith OM, Misteli T. Regulation of alternative splicing by histone modifications. Science. 2010;327:996–1000. - PMC - PubMed
1. Andersen LB, Ballester R, Marchuk DA, Chang E, Gutmann DH, Saulino AM, Camonis J, Wigler M, Collins FS. A conserved alternative splice in the von Recklinghausen neurofibromatosis (NF1) gene produces two neurofibromin isoforms, both of which have GTPase-activating protein activity. Mol. Cell. Biol. 1993;13:487–495. - PMC - PubMed
1. Screaton GR, Bell MV, Jackson DG, Cornelis FB, Gerth U, Bell JI. Genomic structure of DNA encoding the lymphocyte homing receptor CD44 reveals at least 12 alternatively spliced exons. Proc. Natl Acad. Sci. USA. 1992;89:12160–12164. - PMC - PubMed
1. Kwan T, Benovoy D, Dias C, Gurd S, Provencher C, Beaulieu P, Hudson TJ, Sladek R, Majewski J. Genome-wide analysis of transcript isoform variation in humans. Nat. Genet. 2008;40:225–231. - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Affiliation

MapSplice: accurate mapping of RNA-seq reads for splice junction discovery

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources