The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote
- PMID: 23558742
- PMCID: PMC3664803
- DOI: 10.1093/nar/gkt214
The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote
Abstract
Read alignment is an ongoing challenge for the analysis of data from sequencing technologies. This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome. The new strategy chooses the mapped genomic location for the read directly from the seeds. It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location. When the read length is <160 bp, overlapping subreads are used. More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block. The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done. It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads. It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene. It scales up efficiently for longer reads.
Figures
) of the two mapped regions in the reference is equal to the size (L) of the read region that is spanned by the subreads that vote for the best two mapping locations, the determined splicing points will be recorded in the putative exon–exon junction table. Anchor locations of each read in the genome and in the read are also recorded, which gives the mapping location to which the read is best mapped and the location of the leftmost base of the set of extracted subreads that vote for that location, respectively. Anchor locations will be used for retrieving putative splicing points and for the validation performed by the second scan. The first scan is applied to all the reads, and two tables are produced on completion. These two tables include chromosomal locations of putative splicing points found for each exon–exon junction and anchor information for each read, respectively. The input to the second scan includes these two tables and also the read data. For each read, the second scan uses its anchor location to search for the putative splicing points falling within the read from the junction table output from the first scan and then examines all mapping possibilities (including mapping the read as an exonic read) to eventually determine how the read should be mapped. The similarity between the read sequence and the mapped regions when it is mapped as a junction read has to be greater than that from being mapped as an exonic read (i.e.
), if it is called a junction read. The cyan dashed line indicates the mapping location of the first base or the last base of the read when it is assumed that the read does not contain junctions. Putative splicing points are removed from the final results if they are found to not have any supporting reads after the second scan is completed. The final output from this two-scan procedure is a table of validated exon–exon junctions with the number of supporting reads included, and also the complete mapping results for each read including CIGAR strings, which describes how each base in each read is mapped.
References
-
- Marco-Sola S, Sammeth M, Guig R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods. 2012;9:1185–1188. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
