Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov 22;26(1):bbae646.
doi: 10.1093/bib/bbae646.

Filtering out the noise: metagenomic classifiers optimize ancient DNA mapping

Affiliations

Filtering out the noise: metagenomic classifiers optimize ancient DNA mapping

Shyamsundar Ravishankar et al. Brief Bioinform. .

Abstract

Contamination with exogenous DNA presents a significant challenge in ancient DNA (aDNA) studies of single organisms. Failure to address contamination from microbes, reagents, and present-day sources can impact the interpretation of results. Although field and laboratory protocols exist to limit contamination, there is still a need to accurately distinguish between endogenous and exogenous data computationally. Here, we propose a workflow to reduce exogenous contamination based on a metagenomic classifier. Unlike previous methods that relied exclusively on DNA sequencing reads mapping specificity to a single reference genome to remove contaminating reads, our approach uses Kraken2-based filtering before mapping to the reference genome. Using both simulated and empirical shotgun aDNA data, we show that this workflow presents a simple and efficient method that can be used in a wide range of computational environments-including personal machines. We propose strategies to build specific databases used to profile sequencing data that take into consideration available computational resources and prior knowledge about the target taxa and likely contaminants. Our workflow significantly reduces the overall computational resources required during the mapping process and reduces the total runtime by up to ~94%. The most significant impacts are observed in low endogenous samples. Importantly, contaminants that would map to the reference are filtered out using our strategy, reducing false positive alignments. We also show that our method results in a negligible loss of endogenous data with no measurable impact on downstream population genetics analyses.

Keywords: Kraken2; ancient DNA; contamination; filtering, metagenomic classifiers.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1
Figure 1
(A) Workflow of types of simulated data, different methods applied to the simulated data, and metrics collected. The best-performing method is applied to empirical data. (B) Ancient human reads classified at the order Primates or lower taxonomic ranks are considered endogenous reads and hence retained (red), similarly for ancient dog reads at the order Carnivora or lower (purple). Unclassified reads represented as grey are reads that could not be assigned a taxonomy.
Figure 2
Figure 2
Impact of database choice on Kraken2 classification of (A) 20 million ancient dog reads, and (B) 20 million ancient human reads. The x-axis represents the k-mer length of the databases (DB; see Table 1 for descriptions) represented in the facets. The y-axis shows the proportion of reads classified as a particular taxonomy (colours).
Figure 3
Figure 3
Precision and recall (A) and f-measure (B) of the six methods—bwa mapping to a single (colour: blue & shape: small circle) and composite dog and human reference (competitive mapping; colour: purple & shape: large circle), mapping only reads classified as Carnivora and unclassified reads by the ‘k2_custom’ database to a single (negative filtering; colour: light red & shape: triangle) and composite reference (negative filtering w/ competitive mapping; colour: dark red & shape: upside-down triangle), mapping only reads classified by ‘k2_canis_lupus_kmer29’ database to a single (positive filtering; colour: light green & shape: square) and composite reference (positive filtering w/ competitive mapping; colour: dark green & shape: diamond)—for the simulated ancient dog genome. Reads were filtered with MapQ >20 postmapping.
Figure 4
Figure 4
(A) Kraken2 + bwa aln filtering and mapping speed, normalized to bwa aln mapping only speed; (B) coverage difference between mapping only and filtering before mapping; and (C) observed f4 statistics of the configuration f4(SampleFilt (blue), SampleBWA (brown); Basenji01, Coyote01California) from pseudo-haploid genotypes. Multiple points indicate replicate pseudo-haploid calls to account for variability introduced by random pseudo-haploidization. The points are coloured by |Z| score. |Z| values below 3 are on a green-to-yellow gradient. |Z| values above 3 are denoted with red. Filtering using only the reference genome (left panel) led to samples CANIS-ALAS-016 and AL2744 being significantly biased (Z > 3) towards the reference. Adding Canid variation from the 722 g project (right panel) shows a nonsignificant deviation (|Z| < 3) from 0 for all samples.

References

    1. Pinhasi R, Fernandes DM, Sirak K. et al. Isolating the human cochlea to generate bone powder for ancient DNA analysis. Nat Protoc 2019;14:1194–205. 10.1038/s41596-019-0137-7. - DOI - PubMed
    1. Shirazi S, Broomandkhoshbacht N, Oppenheimer J. et al. Ancient DNA-based sex determination of bison hide moccasins indicates promontory cave occupants selected female hides for footwear. J Archaeol Sci 2022;137:105533. 10.1016/j.jas.2021.105533. - DOI
    1. Wagner S, Lagane F, Seguin-Orlando A. et al. High-throughput DNA sequencing of ancient wood. Mol Ecol 2018;27:1138–54. 10.1111/mec.14514. - DOI - PMC - PubMed
    1. Warinner C, Rodrigues JFM, Vyas R. et al. Pathogens and host immunity in the ancient human oral cavity. Nat Genet 2014;46:336–44. 10.1038/ng.2906. - DOI - PMC - PubMed
    1. Zhang M, Cao P, Dai Q-Y. et al. Comparative analysis of DNA extraction protocols for ancient soft tissue museum samples. Zool Res 2021;42:280–6. 10.24272/j.issn.2095-8137.2020.377. - DOI - PMC - PubMed