The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Yang Liao¹, Gordon K Smyth, Wei Shi

Affiliations

PMID: 23558742
PMCID: PMC3664803
DOI: 10.1093/nar/gkt214

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Yang Liao et al. Nucleic Acids Res. 2013.

. 2013 May 1;41(10):e108.

doi: 10.1093/nar/gkt214. Epub 2013 Apr 4.

Authors

Yang Liao¹, Gordon K Smyth, Wei Shi

Affiliation

¹ Division of Bioinformatics, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia.

PMID: 23558742
PMCID: PMC3664803
DOI: 10.1093/nar/gkt214

Abstract

Read alignment is an ongoing challenge for the analysis of data from sequencing technologies. This article proposes an elegantly simple multi-seed strategy, called seed-and-vote, for mapping reads to a reference genome. The new strategy chooses the mapped genomic location for the read directly from the seeds. It uses a relatively large number of short seeds (called subreads) extracted from each read and allows all the seeds to vote on the optimal location. When the read length is <160 bp, overlapping subreads are used. More conventional alignment algorithms are then used to fill in detailed mismatch and indel information between the subreads that make up the winning voting block. The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done. It is sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads. It is accurate because the final location must be supported by several different subreads. The strategy extends easily to find exon junctions, by locating reads that contain sets of subreads mapping to different exons of the same gene. It scales up efficiently for longer reads.

PubMed Disclaimer

Figures

**Figure 1.**
Seed-and-vote mapping paradigm. (A) Schematic of the proposed mapping paradigm. Subreads (or seeds) are short continuous sequences extracted from each read. Substrings in green are uninformative subreads, and they are excluded from voting. Little red bars denote mismatched bases. Mapping location of the read is determined by the largest consensus set. The thin solid arrows point to the mapping location of each subread included in the largest consensus set. Mapping location of the read, as indicated by the black up-pointing triangle, is voted for by all the subreads in the largest consensus set. The dashed arrows indicate other mapping locations for the subreads, and these locations were disregarded due to insufficient number of votes. (B) Using an artificial example to illustrate the paradigm. Six subreads are extracted from the artificial read. Each square bracket denotes an extracted subread, which contains five continuous bases, and the number embedded in the blue cycle indicates the subread number. Base sequence of each subread is encoded into a string of 0’s and 1’s (each base is encoded into a 2-bit binary number). Encoded value for each subread is used as its key in the hash table. The key’s value gives the chromosomal location/s in the genome to which the corresponding subread is perfectly matched (no mismatches allowed). Four candidate mapping locations are found for this artificial read, which receive 2, 5, 1 and 2 votes (number of consensus subreads), respectively. The location that receives the largest number of votes, in this case the location with five votes, is selected as the final mapping location for this artificial read. (C) Indel detection performed under the seed-and-vote paradigm. (C1) shows the mapping results of subreads when there are no indels found in the reads (assuming no mismatches exist in the read for simplicity). (C2) and (C3) show respectively the schematic for detecting an insertion (Ins) and a deletion (Del) in the situation where insertion or deletion is found in the read and flanking subreads are found at both sides of insertion or deletion. (C4) gives the schematic for detecting indels when they occur at the locations close to the end of the reads where flanking subreads can be found at only one side. In (C2) and (C3), chromosomal locations pointed to by red arrows are the true mapping locations of subreads 8, 9 and 10, respectively, and chromosomal locations pointed to by dotted black arrows indicate the chromosomal locations to which they will be mapped if no indels exist before them. d is the indel length, equal to the difference between the location pointed to by the red arrow and the location pointed to by the dotted black arrow from the same subread. Regions encompassed by the dotted green lines are found to contain indels [(C2) and (C3)] or are candidate regions for searching indels (C4). Bases in these regions are not covered by subreads that have made successful votes, and their mapping locations will be determined by aligning to the corresponding regions (within the dotted green lines) in the reference genome. In (C4), a 4 bp window is moved along the uncovered bases to look for potential indels. When three or more bases in the window are found to be mismatches, the indel detection process is triggered for the search of indels.

**Figure 2.**
Schematic for detecting exon–exon junctions under seed-and-vote paradigm. A two-scan procedure is used to detect exon–exon junctions and to determine the mapping of each read. Three artificial reads are used to illustrate this procedure (Read 1, Read 2 and Read 50). In the first scan, a set of subreads are extracted from each read and mapped to the reference genome. The best two mapping locations from each read, which receives the two largest numbers of votes, are selected for further consideration. If donor and receptor sites are found between these two locations and total size () of the two mapped regions in the reference is equal to the size (L) of the read region that is spanned by the subreads that vote for the best two mapping locations, the determined splicing points will be recorded in the putative exon–exon junction table. Anchor locations of each read in the genome and in the read are also recorded, which gives the mapping location to which the read is best mapped and the location of the leftmost base of the set of extracted subreads that vote for that location, respectively. Anchor locations will be used for retrieving putative splicing points and for the validation performed by the second scan. The first scan is applied to all the reads, and two tables are produced on completion. These two tables include chromosomal locations of putative splicing points found for each exon–exon junction and anchor information for each read, respectively. The input to the second scan includes these two tables and also the read data. For each read, the second scan uses its anchor location to search for the putative splicing points falling within the read from the junction table output from the first scan and then examines all mapping possibilities (including mapping the read as an exonic read) to eventually determine how the read should be mapped. The similarity between the read sequence and the mapped regions when it is mapped as a junction read has to be greater than that from being mapped as an exonic read (i.e. ), if it is called a junction read. The cyan dashed line indicates the mapping location of the first base or the last base of the read when it is assumed that the read does not contain junctions. Putative splicing points are removed from the final results if they are found to not have any supporting reads after the second scan is completed. The final output from this two-scan procedure is a table of validated exon–exon junctions with the number of supporting reads included, and also the complete mapping results for each read including CIGAR strings, which describes how each base in each read is mapped.

formula image — **Figure 2.**
Schematic for detecting exon–exon junctions under seed-and-vote paradigm. A two-scan procedure is used to detect exon–exon junctions and to determine the mapping of each read. Three artificial reads are used to illustrate this procedure (Read 1, Read 2 and Read 50). In the first scan, a set of subreads are extracted from each read and mapped to the reference genome. The best two mapping locations from each read, which receives the two largest numbers of votes, are selected for further consideration. If donor and receptor sites are found between these two locations and total size () of the two mapped regions in the reference is equal to the size (L) of the read region that is spanned by the subreads that vote for the best two mapping locations, the determined splicing points will be recorded in the putative exon–exon junction table. Anchor locations of each read in the genome and in the read are also recorded, which gives the mapping location to which the read is best mapped and the location of the leftmost base of the set of extracted subreads that vote for that location, respectively. Anchor locations will be used for retrieving putative splicing points and for the validation performed by the second scan. The first scan is applied to all the reads, and two tables are produced on completion. These two tables include chromosomal locations of putative splicing points found for each exon–exon junction and anchor information for each read, respectively. The input to the second scan includes these two tables and also the read data. For each read, the second scan uses its anchor location to search for the putative splicing points falling within the read from the junction table output from the first scan and then examines all mapping possibilities (including mapping the read as an exonic read) to eventually determine how the read should be mapped. The similarity between the read sequence and the mapped regions when it is mapped as a junction read has to be greater than that from being mapped as an exonic read (i.e. ), if it is called a junction read. The cyan dashed line indicates the mapping location of the first base or the last base of the read when it is assumed that the read does not contain junctions. Putative splicing points are removed from the final results if they are found to not have any supporting reads after the second scan is completed. The final output from this two-scan procedure is a table of validated exon–exon junctions with the number of supporting reads included, and also the complete mapping results for each read including CIGAR strings, which describes how each base in each read is mapped.

**Figure 3.**
Performance of aligners in detecting deletions of different sizes. The horizontal axis gives the cumulative deletion sizes. For each size, the data sets with equal or smaller deletion sizes were combined and used for calculating the recall and accuracy for that size for each aligner. Aligners were run at their best possible settings for the detection of deletions with different sizes.

**Figure 4.**
Recall and accuracy of aligners with respect to MQSs. (a) and (c) give the cumulative number of correctly mapped reads and incorrectly mapped reads from high to low mapping quality. (b) and (d) show the cumulative accuracy and error fractions from high to low mapping quality. (a) and (b) use the indel data set included in Table 4, and (c) and (d) use the Mason data set included in Table 5. Each point in each plot corresponds to an MQS given by an aligner. Subread was run with default setting in (a) and (b) and with -f 100 in (c) and (d).

See this image and copyright information in PMC

References

1. Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
1. Marco-Sola S, Sammeth M, Guig R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat. Methods. 2012;9:1185–1188. - PubMed
1. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. - PMC - PubMed
1. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. - PMC - PubMed
1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Affiliation

The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases