. 2013 Jun 7:14:184.

doi: 10.1186/1471-2105-14-184.

Benchmarking short sequence mapping tools

Ayat Hatem¹, Doruk Bozdağ, Amanda E Toland, Ümit V Çatalyürek

Affiliations

PMID: 23758764
PMCID: PMC3694458
DOI: 10.1186/1471-2105-14-184

Benchmarking short sequence mapping tools

Ayat Hatem et al. BMC Bioinformatics. 2013.

. 2013 Jun 7:14:184.

doi: 10.1186/1471-2105-14-184.

Authors

Ayat Hatem¹, Doruk Bozdağ, Amanda E Toland, Ümit V Çatalyürek

Affiliation

¹ Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA.

PMID: 23758764
PMCID: PMC3694458
DOI: 10.1186/1471-2105-14-184

Abstract

Background: The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison.

Results: We applied our benchmarking tests on 9 well known mapping tools, namely, Bowtie, Bowtie2, BWA, SOAP2, MAQ, RMAP, GSNAP, Novoalign, and mrsFAST (mrFAST) using synthetic data and real RNA-Seq data. MAQ and RMAP are based on building hash tables for the reads, whereas the remaining tools are based on indexing the reference genome. The benchmarking tests reveal the strengths and weaknesses of each tool. The results show that no single tool outperforms all others in all metrics. However, Bowtie maintained the best throughput for most of the tests while BWA performed better for longer read lengths. The benchmarking tests are not restricted to the mentioned tools and can be further applied to others.

Conclusion: The mapping process is still a hard problem that is affected by many factors. In this work, we provided a benchmarking suite that reveals and evaluates the different factors affecting the mapping process. Still, there is no tool that outperforms all of the others in all the tests. Therefore, the end user should clearly specify his needs in order to choose the tool that provides the best results.

PubMed Disclaimer

Figures

**Figure 1**
**Evaluation criteria.** An example showing how the different evaluation criteria work. In the upper part of the figure, the sequence in blue is the original genomic position where the simulated read was extracted from. After applying sequencing errors, the read does not exactly match to the original location (3 mismatches). In the lower part of the figure, three possible alignment locations for the read are shown with their mapping quality score (MQ). The naïve criterion would only consider the alignment (1) as the correct alignment. For Ruffalo et al. [32] criterion, if the used threshold is 30, then (1) is *correctly mapped* while (2) and (3) are *incorrectly mapped-strict*. On the other hand, if the threshold is 40, then (3) is considered as *incorrectly mapped relaxed*. Holtgrewe et al. [31] criterion in the oracle mode would detect (1) and (2) and consider them *correctly mapped* while (3) would be considered as *incorrectly mapped*.

**Figure 2**
**Default options effect using** **wgsim**. Mapping 1 million reads of length 125 extracted from the Human genome using wgsim. Each tool was allowed to use its own default options. BWA-ND refers to BWA’s results while using Bowtie’s default options which are 2 mismatches in the seed, 3 mismatches in the whole read, and no gapped alignment.

**Figure 3**
**Default options effect using ART.** Mapping 1 million reads of length 100 extracted from the Human genome using ART. Each tool was allowed to use its own default options.

**Figure 4**
**Quality threshold vs. number of mismatches.** Mapping 1 million reads of length 125 extracted using wgsim from the Human genome while allowing up to 7 mismatches and a quality threshold of 140. The *error* is 0.6% for SOAP2 and MAQ and 0.45% for GSNAP.

**Figure 5**
**Effect of changing the number of mismatches using a synthetic data set extracted using** **wgsim**. Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 125 extracted from the Human genome using wgsim was used in this experiment.

**Figure 6**
**Effect of changing the number of mismatches using a synthetic data set extracted using** **ART**. Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A data set of 1 million reads of length 100 extracted from the Human genome using ART was used in this experiment.

**Figure 7**
**Effect of changing the number of mismatches using a real data set.** Comparing the different tools while changing the total mismatches from 2 to 7. T-mms stands for the maximum allowed mismatches. A real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain and mapped against the mouse genome version mm9 was used in this experiment.

**Figure 8**
**Effect of changing the seed length using a synthetic data set.** The effect of changing the seed length on the BWT based tools. The tools were used to map 16 million reads of length 70 bps on the Human genome. SOAP2 does not support seed length < 28.

**Figure 9**
**Effect of changing the seed length using a real data set.** The effect of changing the seed length on the BWT based tools. The tools were used to map real mRNA data set of 1 million reads of length 51 bps extracted from the Spretus mouse strain on the mouse genome version mm9. SOAP2 does not support seed length < 28.

**Figure 10**
**Effect of changing the read length using a synthetic data set extracted using** **wgsim**. The effect of changing the read length from 36 to 500. The reads were extracted from the Human genome. RMAP and MAQ are slower than the other tools. Therefore, 1 million reads were used to test MAQ and RMAP while 16 million reads were used for the remaining ones.

**Figure 11**
**Effect of changing the read length using a** **ART** **generated data set.** The effect of mapping 1 million reads extracted by ART from the mouse genome version mm9 while changing the read length from 36 to 100.

**Figure 12**
**Effect of using paired-end data using a** **wgsim** **synthetic data set.** The effect of mapping paired-end reads of length 70 to the Human genome. 1 million reads were used to test RMAP and MAQ while 16 million reads were used to test the other tools. SE and PE refer to single end and paired end, respectively. *Error* is only provided for PE due to exceeding the allowed insert size mrsFAST is used for the ungapped alignment and mrFAST is used for the gapped one.

**Figure 13**
**Effect of changing the genome type using** **wgsim** **generated synthetic data set.** 16 million reads of length 70 bps were generated from the Human, Zebrafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using wgsim for this test. 1 million reads were used for MAQ and RMAP.

**Figure 14**
**Effect of changing the genome type using** **ARTgenerated synthetic data set.** 1 million reads of length 70bps were generated from the Human, Zebrafish, Lancelet, Chimpanzee, A. mellifera, and C. elegans genomes using ART.

**Figure 15**
**Effect of enabling gapped alignment using a real data set.** mRNA data set of 1 million reads extracted from the Spretus mouse strain is used in this experiment and mapped on the mouse genome version mm9.

**Figure 16**
**Speedup when using multithreading and multiprocessing.** 16 million reads of length 125 were mapped to the Human genome while using multithreading (the upper figure) or multiprocessing (the lower figure).

See this image and copyright information in PMC

References

1. National human genome institute. [ http://www.genome.gov]
1. Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009;6(11s):S6–12. doi: 10.1038/nmeth.1376. - DOI - PubMed
1. Cokus S, Feng S. et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nat. 2008;452(7184):215–219. doi: 10.1038/nature06745. - DOI - PMC - PubMed
1. Sultan M, Schulz M, Richard H. et al. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008;321(5891):956–960. doi: 10.1126/science.1160342. - DOI - PubMed
1. Van Tessel CP, Simth TPL, Matukumali LK. et al. SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nat Methods. 2008;5(3):247–252. doi: 10.1038/nmeth.1185. - DOI - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 CA133461/CA/NCI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Benchmarking short sequence mapping tools

Affiliation

Benchmarking short sequence mapping tools

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources