Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 1;40(7):btae436.
doi: 10.1093/bioinformatics/btae436.

Unravelling reference bias in ancient DNA datasets

Affiliations

Unravelling reference bias in ancient DNA datasets

Stephanie Dolenz et al. Bioinformatics. .

Abstract

Motivation: The alignment of sequencing reads is a critical step in the characterization of ancient genomes. However, reference bias and spurious mappings pose a significant challenge, particularly as cutting-edge wet lab methods generate datasets that push the boundaries of alignment tools. Reference bias occurs when reference alleles are favoured over alternative alleles during mapping, whereas spurious mappings stem from either contamination or when endogenous reads fail to align to their correct position. Previous work has shown that these phenomena are correlated with read length but a more thorough investigation of reference bias and spurious mappings for ancient DNA has been lacking. Here, we use a range of empirical and simulated palaeogenomic datasets to investigate the impacts of mapping tools, quality thresholds, and reference genome on mismatch rates across read lengths.

Results: For these analyses, we introduce AMBER, a new bioinformatics tool for assessing the quality of ancient DNA mapping directly from BAM-files and informing on reference bias, read length cut-offs and reference selection. AMBER rapidly and simultaneously computes the sequence read mapping bias in the form of the mismatch rates per read length, cytosine deamination profiles at both CpG and non-CpG sites, fragment length distributions, and genomic breadth and depth of coverage. Using AMBER, we find that mapping algorithms and quality threshold choices dictate reference bias and rates of spurious alignment at different read lengths in a predictable manner, suggesting that optimized mapping parameters for each read length will be a key step in alleviating reference bias and spurious mappings.

Availability and implementation: AMBER is available for noncommercial use on GitHub (https://github.com/tvandervalk/AMBER.git). Scripts used to generate and analyse simulated datasets are available on Github (https://github.com/sdolenz/refbias_scripts).

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Assessing minimum read length thresholds in ancient genomic datasets aligned using BWA-aln. (a) AMBER plots at MQ ≥1 for four empirical datasets (steppe mammoth, maize, human, and horse) and a simulated dataset (10% endogenous elephant with 1% sequence divergence) showing a secondary peak at 20–30 bp; (b) mismatch and fragment length distribution plots for a simulated dataset of 10% endogenous elephant with 1% sequence divergence at four MQ thresholds; (c) the proportions of simulated endogenous elephant with 1% sequence divergence, human, and bacterial data aligned using BWA-aln to the Asian elephant genome for two endogenous DNA contents: 10% (MQ ≥1, ≥30) and 1% (MQ ≥1). Aligned ultrashort reads (≤35 bp) are dominated by bacteria, with this trend enhanced at a lower endogenous DNA content and only marginally reduced at MQ ≥30. For all comparisons, see Supplementary Data S3. In (c), zero values are not plotted
Figure 2.
Figure 2.
The impact of three mapping algorithms (BWA-aln, Bowtie2, BWA-mem) on ancient genomic datasets. (a) AMBER plots for the American mastodon empirical dataset with MQ ≥1; (b) mismatch and fragment length distribution plots for a simulated dataset of 100% endogenous elephant with 2% sequence divergence at MQ ≥1. (c) The counts of mismapped reads for each read length for the three aligners at MQ ≥1, ≥20, ≥25, or ≥30. For all comparisons, see Supplementary Data S4. Bowtie2 exhibits the greatest reference bias for read lengths typical of ancient DNA (30–80 bp), whereas BWA-aln shows reference bias for read lengths >120 bp. BWA-mem does not exhibit this latter bias, but maximizes reference bias for alignments ≤40 bp. There were 100 000 available reads per length bin in all simulated datasets
Figure 3.
Figure 3.
The impact of filtering using different MQ thresholds (≥1, ≥20, ≥25, or ≥30) on ancient genomic datasets. (a) AMBER plots for the steppe mammoth empirical dataset mapped with Bowtie2; (b) mismatch and fragment length distribution plots for a simulated dataset of 100% endogenous Asian elephant reads with 1% sequence divergence mapped with Bowtie2 at varying MQ thresholds; (c) Bowtie2, BWA-aln and BWA-mem, and using MQ ≥25. Higher MQ score thresholds differentially impact the various aligners, with the greatest impact on Bowtie2-mapped data. There were 100 000 available reads per length bin in all simulated datasets. The vertical dashed lines on panel (a) indicate the average depth of coverage achieved for each MQ threshold considered
Figure 4.
Figure 4.
The impact of ancient genomic datasets with differing sample-reference edit distances (sequence divergence). (a) AMBER plots for a non-USER-treated Siberian unicorn, a USER-treated American mastodon, and non-USER-treated maize and human mapped with BWA-aln and MQ ≥1; (b) mismatch and fragment length distribution plots for simulated datasets of 100% endogenous elephant reads with 1%–6% sequence divergence mapped with BWA-aln and MQ ≥1. Higher sequence divergences, especially those ≥3%, are greatly impacted by reference bias. Comparisons of up to 15% sequence divergence and a USER/non-USER comparison can be found in Supplementary Texts S5 and S6. There were 100 000 available reads per length bin in all simulated datasets. The vertical dashed lines on panel (a) indicate the average depth of coverage achieved in each species

References

    1. Briggs AW, Stenzel U, Johnson PLF. et al. Patterns of damage in genomic DNA sequences from a neandertal. Proc Natl Acad Sci USA 2007;104:14616–21. - PMC - PubMed
    1. Briggs AW, Stenzel U, Meyer M. et al. Removal of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res 2010;38:e87. - PMC - PubMed
    1. Chen S, Zhou Y, Chen Y. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90. - PMC - PubMed
    1. Fernandez-Guerra A, Borrel G, Delmont TO. et al. A 2-million-year-old microbial and viral communities from the Kap København Formation in North Greenland. bioRxiv, 10.1101/2023.06.10.544454, 2023, preprint: not peer reviewed. - DOI
    1. Feuerborn TR, Palkopoulou E, van der Valk T. et al. Competitive mapping allows for the identification and exclusion of human DNA contamination in ancient faunal genomic datasets. BMC Genomics 2020;21:844. - PMC - PubMed

Publication types