Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 May 10:13:178.
doi: 10.1186/1471-2164-13-178.

Improving ancient DNA read mapping against modern reference genomes

Affiliations

Improving ancient DNA read mapping against modern reference genomes

Mikkel Schubert et al. BMC Genomics. .

Abstract

Background: Next-Generation Sequencing has revolutionized our approach to ancient DNA (aDNA) research, by providing complete genomic sequences of ancient individuals and extinct species. However, the recovery of genetic material from long-dead organisms is still complicated by a number of issues, including post-mortem DNA damage and high levels of environmental contamination. Together with error profiles specific to the type of sequencing platforms used, these specificities could limit our ability to map sequencing reads against modern reference genomes and therefore limit our ability to identify endogenous ancient reads, reducing the efficiency of shotgun sequencing aDNA.

Results: In this study, we compare different computational methods for improving the accuracy and sensitivity of aDNA sequence identification, based on shotgun sequencing reads recovered from Pleistocene horse extracts using Illumina GAIIx and Helicos Heliscope platforms. We show that the performance of the Burrows Wheeler Aligner (BWA), that has been developed for mapping of undamaged sequencing reads using platforms with low rates of indel-types of sequencing errors, can be employed at acceptable run-times by modifying default parameters in a platform-specific manner. We also examine if trimming likely damaged positions at read ends can increase the recovery of genuine aDNA fragments and if accurate identification of human contamination can be achieved using a strategy previously suggested based on best hit filtering. We show that combining our different mapping and filtering approaches can increase the number of high-quality endogenous hits recovered by up to 33%.

Conclusions: We have shown that Illumina and Helicos sequences recovered from aDNA extracts could not be aligned to modern reference genomes with the same efficiency unless mapping parameters are optimized for the specific types of errors generated by these platforms and by post-mortem DNA damage. Our findings have important implications for future aDNA research, as we define mapping guidelines that improve our ability to identify genuine aDNA sequences, which in turn could improve the genotyping accuracy of ancient specimens. Our framework provides a significant improvement to the standard procedures used for characterizing ancient genomes, which is challenged by contamination and often low amounts of DNA material.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Exploring the effects of different sets of mapping parameters on BWA performance and runtime. Sequencing reads recovered from the sample showing infinite radiocarbon date were aligned using different combinations of mapping parameters using the BWA aligner. Reads were considered of high-quality when mapping uniquely to the EquCab2 genome but not against the human genome (assembly hg19) and showing mapping qualities of at least 25. For Illumina, positive hits were filtered for PCR duplicates (see Methods). Performance and runtime are estimated with reference to the standard default parameters. Left: Helicos tSMS reads. Right: Illumina reads.
Figure 2
Figure 2
Nucleotide misincorporation patterns observed with standard and optimized BWA mapping parameters. Nucleotide misincorporation patterns observed when using the reads recovered from default or optimized BWA parameters are shown on the left and in the middle columns respectively. Nucleotide misincorporation patterns observed on the fraction of high-quality hits identified only with the optimized set of parameters are shown on the right. For both Illumina and Helicos sequencing data, the seed was disabled in the optimized set of mapping parameters (−l 1024). For Helicos tSMS reads, we further increased the maximum number of gap opens to 2 (−o 2) as well as the edit distance (−n 0.03) and allowed for indels at read termini (−i 0). Red: C → T. Blue: G → A. Pink: Insertions. Green: Deletions. Orange: Clipped bases. Grey: Other misincorporations.
Figure 3
Figure 3
Exploring the effects of different sets of mapping parameters on BWA performance and runtime. Helicos sequencing reads recovered from the sample showing a finite radiocarbon date (13,389 ± 52BP) were aligned using different combinations of mapping parameters using the BWA aligner. Reads were considered of high-quality when mapping uniquely to the EquCab2 genome but not against the human genome (assembly hg19) and showing mapping qualities of at least 25. Performance and runtime are estimated with reference to the standard default parameters.
Figure 4
Figure 4
Nucleotide misincorporation patterns observed with standard and optimized BWA mapping parameters for read mapping both the horse and the chicken genome. Nucleotide misincorporation patterns observed for alignments against the horse genome (EquCab2), when using the reads recovered from default or optimized BWA parameters are shown on the left and on the right, respectively. For both Illumina and Helicos sequencing data, the seed was disabled in the optimized set of mapping parameters (−l 1024). For Helicos tSMS reads, we further increased the maximum number of gap opens to 2 (−o 2) as well as the edit distance (−n 0.03) and allowed for indels at read termini (−i 0). Reads were considered when mapping uniquely both to the EquCab2 and galGal3 genomes but not against the human genome (assembly hg19) and showing mapping qualities of at least 25. Red: C → T. Blue: G → A. Pink: Insertions. Green: Deletions. Orange: Clipped bases. Grey: Other misincorporations.
Figure 5
Figure 5
Nucleotide misincorporation patterns observed after trimming the first or the first two bases of sequencing reads. Nucleotide misincorporation patterns observed when using the reads recovered after trimming the first base (left) or the two first bases (right) of the sequencing reads, provided that in absence of trimming one or two successive nucleotide misincorporations would have been observed as a result of cytosine post-mortem deamination (see Methods). Red: C → T. Blue: G → A. Pink: Insertions. Green: Deletions. Orange: Clipped bases. Grey: Other misincorporations.
Figure 6
Figure 6
Nucleotide misincorporation patterns observed following different filtering procedures for human sequences. Helicos and Illumina sequencing reads recovered from the sample showing infinite radiocarbon date were aligned using different combinations of mapping parameters using the BWA aligner. Reads were considered of high-quality when mapping uniquely to the equCab2 genome but not against the human genome (assembly hg19) and showing mapping qualities of at least 25. In a first mapping procedure (Panel A), reads were considered of high-quality when mapping uniquely to the EquCab2 genome but not against the human genome (assembly hg19). In a second mapping procedure (Panel B), reads were considered of high-quality when mapping uniquely to the EquCab2 genome as long as no hit was observed against the human genome (assembly hg19) or as long as the edit distance to the horse genome was lower than the edit distance to the human genome. High-quality reads presented minimal mapping qualities of 25. Nucleotide misincorporation patterns were plotted following mapping with the optimized set of BWA parameters for different subsets of reads. Panel A: Alignments against the horse reference genome, excluding any read that also map against the human reference genome (first column); alignments against the horse reference genome, for reads that also map against the human reference genome (second column); alignments against the human reference genome, for reads that also map against the horse reference genome (third column); alignments against the human reference genome, excluding any read that also maps against the horse reference genome (last column). Panel B: reads showing hits to the horse reference genome only (first column); reads showing hits to the horse and the human reference genomes, and that were filtered in the filtering procedure presented on Panel A (second column); reads showing hits to the horse and the human reference genomes but a lower edit distance to the horse genome (third column); reads showing hits to the horse and the human reference genomes but a lower or equal edit distance to the human genome (last column). Red: C → T. Blue: G → A. Pink: Insertions. Green: Deletions. Orange: Clipped bases. Grey: Other misincorporations.
Figure 7
Figure 7
Divergence estimates based on different mapping and filtering procedures. Illumina reads recovered from the sample showing infinite radiocarbon date were aligned using the default (top) or the recommended modified (bottom) combination of mapping parameters using the BWA aligner. High-quality hits were further filtered according to a strict (no hit on the human genome) or a best hit criterion (horse high-quality reads are discarded if showing better alternative hit on the human genome). Divergence to the modern reference genome and GC → AT misincorporation rates were calculated and reported with black and red lines as a function of base quality scores. Reads were either considered full length (0 nt) or masked for 5 nucleotides at both ends (5 nt). Further trimming (10 nucleotides) was performed and showed similar results (data not shown). BQ: Base-Quality.

References

    1. Willerslev E, Hansen AJ, Rønn R, Brand TB, Barnes I, Wiuf C, Gilichinsky D, Mitchell D, Cooper A. Long-term persistence of bacterial DNA. Curr Biol. 2004;14:R9–R10. doi: 10.1016/j.cub.2003.12.012. - DOI - PubMed
    1. Gilbert MTP, Bandelt H-J, Hofreiter M, Barnes I. Assessing ancient DNA studies. Trends Ecol Evol. 2005;20:541–544. doi: 10.1016/j.tree.2005.07.005. - DOI - PubMed
    1. Willerslev E, Cappellini E, Boomsma W, Nielsen R, Hebsgaard MB, Brand TB, Hofreiter M, Bunce M, Poinar HN, Dahl-Jensen D, Johnsen S, Steffensen JP, Bennike O, Schwenninger J-L, Nathan R, Armitage S, de Hoog C-J, Alfimov V, Christl M, Beer J, Muscheler R, Barker J, Sharp M, Penkman KEH, Haile J, Taberlet P, Gilbert MTP, Casoli A, Campani E, Collins MJ. Ancient biomolecules from deep ice cores reveal a forested southern Greenland. Science. 2007;317:111–114. doi: 10.1126/science.1141758. - DOI - PMC - PubMed
    1. Stiller M, Baryshnikov G, Bocherens H, AG d’Anglade null, Hilpert B, Münzel SC, Pinhasi R, Rabeder G, Rosendahl W, Trinkaus E, Hofreiter M, Knapp M. Withering away--25,000 years of genetic decline preceded cave bear extinction. Mol Biol Evol. 2010;27:975–978. doi: 10.1093/molbev/msq083. - DOI - PubMed
    1. Lorenzen ED, Nogués-Bravo D, Orlando L, Weinstock J, Binladen J, Marske KA, Ugan A, Borregaard MK, Gilbert MTP, Nielsen R, Ho SYW, Goebel T, Graf KE, Byers D, Stenderup JT, Rasmussen M, Campos PF, Leonard JA, Koepfli K-P, Froese D, Zazula G, Stafford TW, Aaris-Sørensen K, Batra P, Haywood AM, Singarayer JS, Valdes PJ, Boeskorov G, Burns JA, Davydov SP, Haile J, Jenkins DL, Kosintsev P, Kuznetsova T, Lai X, Martin LD, McDonald HG, Mol D, Meldgaard M, Munch K, Stephan E, Sablin M, Sommer RS, Sipko T, Scott E, Suchard MA, Tikhonov A, Willerslev R, Wayne RK, Cooper A, Hofreiter M, Sher A, Shapiro B, Rahbek C, Willerslev E. Species-specific responses of Late Quaternary megafauna to climate and humans. Nature. 2011;479:359–364. doi: 10.1038/nature10574. - DOI - PMC - PubMed

Publication types

LinkOut - more resources