Incorporating sequence quality data into alignment improves DNA read mapping

Martin C Frith¹, Raymond Wan, Paul Horton

Affiliations

PMID: 20110255
PMCID: PMC2853142
DOI: 10.1093/nar/gkq010

Incorporating sequence quality data into alignment improves DNA read mapping

Martin C Frith et al. Nucleic Acids Res. 2010 Apr.

. 2010 Apr;38(7):e100.

doi: 10.1093/nar/gkq010. Epub 2010 Jan 27.

Authors

Martin C Frith¹, Raymond Wan, Paul Horton

Affiliation

¹ Computational Biology Research Center, Institute for Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, Japan. martin@cbrc.jp

PMID: 20110255
PMCID: PMC2853142
DOI: 10.1093/nar/gkq010

Abstract

New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.

PubMed Disclaimer

Figures

**Figure 1.**
Estimated error rates for two DNA short-read data sets. (A) Error rates for a set of 36-bp reads from the Solexa 1G Genome Analyzer (the first 100 000 reads of SRR001981). (B) Error rates for a set of 51-bp reads from the Illumina Genome Analyzer II (the first 100 000 reads of SRR016157). For both panels, the error rates were obtained from FASTQ files in the NCBI Short Read Archive.

**Figure 2.**
Mapping accuracy for 100 000 simulated 36-bp reads. The reads differ from the genome by a certain rate of ‘real’ substitutions (0.2, 0.5, 1 or 2%) plus sequencer errors. Each line shows the relationship between the number of correctly and incorrectly mapped reads as the alignment score cutoff is varied. Circles indicate a score cutoff of 150. Dotted lines show the accuracy when we model the substitutions but not the sequencer errors. Dashed lines show the accuracy when we model the sequencer errors but not the substitutions. Solid lines show the accuracy when we model both.

**Figure 3.**
Mapping accuracy for 100 000 simulated 51-bp reads. See legend of Figure 2. Circles indicate a score cutoff of 180.

**Figure 4.**
Mapping accuracy for 100 000 simulated 36-bp reads using a mapping procedure that guarantees to find all matches with up to two substitutions. This is identical to Figure 2, except that a different mapping algorithm was used here.

**Figure 5.**
Estimated mapping accuracy for 100 000 real 36-bp reads from *D. melanogaster*, mapped to the *D. simulans* genome. Circles indicate a score cutoff of 150. The dotted line shows the mapping accuracy when we model the sequencer errors but not the real differences. The solid line shows the accuracy when we model both. The dashed red line shows the accuracy when we model both but forbid insertions and deletions. Correctness was estimated by mapping the reads to the *D. melanogaster* genome (modeling sequencer errors only), and using the UCSC *D. melanogaster* / *D. simulans* pairwise genome alignment to cross-reference the mappings.

See this image and copyright information in PMC

References

1. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE, et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. - PMC - PubMed
1. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. - PMC - PubMed
1. Malde K. The effect of sequence quality on sequence alignment. Bioinformatics. 2008;24:897–900. - PubMed
1. Na JC, Roh K, Apostolico A, Park K. Alignment of biological sequences with quality scores. Int. J. Bioinformatics Res. Appl. 2009;5:97–113. - PubMed
1. Millar CD, Huynen L, Subramanian S, Mohandesan E, Lambert DM. New developments in ancient genomics. Trends Ecol. Evol. 2008;23:386–393. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Incorporating sequence quality data into alignment improves DNA read mapping

Affiliation

Incorporating sequence quality data into alignment improves DNA read mapping

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases