Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Apr;38(7):e100.
doi: 10.1093/nar/gkq010. Epub 2010 Jan 27.

Incorporating sequence quality data into alignment improves DNA read mapping

Affiliations

Incorporating sequence quality data into alignment improves DNA read mapping

Martin C Frith et al. Nucleic Acids Res. 2010 Apr.

Abstract

New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Estimated error rates for two DNA short-read data sets. (A) Error rates for a set of 36-bp reads from the Solexa 1G Genome Analyzer (the first 100 000 reads of SRR001981). (B) Error rates for a set of 51-bp reads from the Illumina Genome Analyzer II (the first 100 000 reads of SRR016157). For both panels, the error rates were obtained from FASTQ files in the NCBI Short Read Archive.
Figure 2.
Figure 2.
Mapping accuracy for 100 000 simulated 36-bp reads. The reads differ from the genome by a certain rate of ‘real’ substitutions (0.2, 0.5, 1 or 2%) plus sequencer errors. Each line shows the relationship between the number of correctly and incorrectly mapped reads as the alignment score cutoff is varied. Circles indicate a score cutoff of 150. Dotted lines show the accuracy when we model the substitutions but not the sequencer errors. Dashed lines show the accuracy when we model the sequencer errors but not the substitutions. Solid lines show the accuracy when we model both.
Figure 3.
Figure 3.
Mapping accuracy for 100 000 simulated 51-bp reads. See legend of Figure 2. Circles indicate a score cutoff of 180.
Figure 4.
Figure 4.
Mapping accuracy for 100 000 simulated 36-bp reads using a mapping procedure that guarantees to find all matches with up to two substitutions. This is identical to Figure 2, except that a different mapping algorithm was used here.
Figure 5.
Figure 5.
Estimated mapping accuracy for 100 000 real 36-bp reads from D. melanogaster, mapped to the D. simulans genome. Circles indicate a score cutoff of 150. The dotted line shows the mapping accuracy when we model the sequencer errors but not the real differences. The solid line shows the accuracy when we model both. The dashed red line shows the accuracy when we model both but forbid insertions and deletions. Correctness was estimated by mapping the reads to the D. melanogaster genome (modeling sequencer errors only), and using the UCSC D. melanogaster / D. simulans pairwise genome alignment to cross-reference the mappings.

Similar articles

Cited by

References

    1. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE, et al. Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215–219. - PMC - PubMed
    1. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. - PMC - PubMed
    1. Malde K. The effect of sequence quality on sequence alignment. Bioinformatics. 2008;24:897–900. - PubMed
    1. Na JC, Roh K, Apostolico A, Park K. Alignment of biological sequences with quality scores. Int. J. Bioinformatics Res. Appl. 2009;5:97–113. - PubMed
    1. Millar CD, Huynen L, Subramanian S, Mohandesan E, Lambert DM. New developments in ancient genomics. Trends Ecol. Evol. 2008;23:386–393. - PubMed

Publication types