. 2024 Jul 1;40(7):btae436.

doi: 10.1093/bioinformatics/btae436.

Unravelling reference bias in ancient DNA datasets

Stephanie Dolenz^{1

2}, Tom van der Valk^{1

3

4}, Chenyu Jin^{1

3

5}, Jonas Oppenheimer⁶, Muhammad Bilal Sharif^{1

5}, Ludovic Orlando⁷, Beth Shapiro^{8

9}, Love Dalén^{1

3

5}, Peter D Heintzman^{1

2}

Affiliations

¹ Centre for Palaeogenetics, Svante Arrhenius väg 20C, Stockholm, SE-106 91, Sweden.
² Department of Geological Sciences, Stockholm University, Stockholm, SE-106 91, Sweden.
³ Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, SE-114 18, Sweden.
⁴ Science for Life Laboratory, Stockholm, SE-171 65, Sweden.
⁵ Department of Zoology, Stockholm University, Stockholm, SE-106 91, Sweden.
⁶ Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.
⁷ Centre for Anthropobiology and Genomics of Toulouse (CAGT, CNRS UMR5288), University Paul Sabatier, Faculté de Santé, Toulouse, 31000, France.
⁸ Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.
⁹ Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.

PMID: 38960861
PMCID: PMC11254355
DOI: 10.1093/bioinformatics/btae436

Unravelling reference bias in ancient DNA datasets

Stephanie Dolenz et al. Bioinformatics. 2024.

. 2024 Jul 1;40(7):btae436.

doi: 10.1093/bioinformatics/btae436.

Authors

Affiliations

¹ Centre for Palaeogenetics, Svante Arrhenius väg 20C, Stockholm, SE-106 91, Sweden.
² Department of Geological Sciences, Stockholm University, Stockholm, SE-106 91, Sweden.
³ Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, SE-114 18, Sweden.
⁴ Science for Life Laboratory, Stockholm, SE-171 65, Sweden.
⁵ Department of Zoology, Stockholm University, Stockholm, SE-106 91, Sweden.
⁶ Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.
⁷ Centre for Anthropobiology and Genomics of Toulouse (CAGT, CNRS UMR5288), University Paul Sabatier, Faculté de Santé, Toulouse, 31000, France.
⁸ Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.
⁹ Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA, 95064, United States.

PMID: 38960861
PMCID: PMC11254355
DOI: 10.1093/bioinformatics/btae436

Abstract

Motivation: The alignment of sequencing reads is a critical step in the characterization of ancient genomes. However, reference bias and spurious mappings pose a significant challenge, particularly as cutting-edge wet lab methods generate datasets that push the boundaries of alignment tools. Reference bias occurs when reference alleles are favoured over alternative alleles during mapping, whereas spurious mappings stem from either contamination or when endogenous reads fail to align to their correct position. Previous work has shown that these phenomena are correlated with read length but a more thorough investigation of reference bias and spurious mappings for ancient DNA has been lacking. Here, we use a range of empirical and simulated palaeogenomic datasets to investigate the impacts of mapping tools, quality thresholds, and reference genome on mismatch rates across read lengths.

Results: For these analyses, we introduce AMBER, a new bioinformatics tool for assessing the quality of ancient DNA mapping directly from BAM-files and informing on reference bias, read length cut-offs and reference selection. AMBER rapidly and simultaneously computes the sequence read mapping bias in the form of the mismatch rates per read length, cytosine deamination profiles at both CpG and non-CpG sites, fragment length distributions, and genomic breadth and depth of coverage. Using AMBER, we find that mapping algorithms and quality threshold choices dictate reference bias and rates of spurious alignment at different read lengths in a predictable manner, suggesting that optimized mapping parameters for each read length will be a key step in alleviating reference bias and spurious mappings.

Availability and implementation: AMBER is available for noncommercial use on GitHub (https://github.com/tvandervalk/AMBER.git). Scripts used to generate and analyse simulated datasets are available on Github (https://github.com/sdolenz/refbias_scripts).

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
Assessing minimum read length thresholds in ancient genomic datasets aligned using *BWA-aln*. (a) AMBER plots at MQ ≥1 for four empirical datasets (steppe mammoth, maize, human, and horse) and a simulated dataset (10% endogenous elephant with 1% sequence divergence) showing a secondary peak at 20–30 bp; (b) mismatch and fragment length distribution plots for a simulated dataset of 10% endogenous elephant with 1% sequence divergence at four MQ thresholds; (c) the proportions of simulated endogenous elephant with 1% sequence divergence, human, and bacterial data aligned using *BWA-aln* to the Asian elephant genome for two endogenous DNA contents: 10% (MQ ≥1, ≥30) and 1% (MQ ≥1). Aligned ultrashort reads (≤35 bp) are dominated by bacteria, with this trend enhanced at a lower endogenous DNA content and only marginally reduced at MQ ≥30. For all comparisons, see Supplementary Data S3. In (c), zero values are not plotted

**Figure 2.**
The impact of three mapping algorithms (*BWA-aln*, *Bowtie2*, *BWA-mem*) on ancient genomic datasets. (a) AMBER plots for the American mastodon empirical dataset with MQ ≥1; (b) mismatch and fragment length distribution plots for a simulated dataset of 100% endogenous elephant with 2% sequence divergence at MQ ≥1. (c) The counts of mismapped reads for each read length for the three aligners at MQ ≥1, ≥20, ≥25, or ≥30. For all comparisons, see Supplementary Data S4. *Bowtie2* exhibits the greatest reference bias for read lengths typical of ancient DNA (30–80 bp), whereas *BWA-aln* shows reference bias for read lengths >120 bp. *BWA-mem* does not exhibit this latter bias, but maximizes reference bias for alignments ≤40 bp. There were 100 000 available reads per length bin in all simulated datasets

**Figure 3.**
The impact of filtering using different MQ thresholds (≥1, ≥20, ≥25, or ≥30) on ancient genomic datasets. (a) AMBER plots for the steppe mammoth empirical dataset mapped with *Bowtie2*; (b) mismatch and fragment length distribution plots for a simulated dataset of 100% endogenous Asian elephant reads with 1% sequence divergence mapped with *Bowtie2* at varying MQ thresholds; (c) *Bowtie2*, *BWA-aln* and *BWA-mem*, and using MQ ≥25. Higher MQ score thresholds differentially impact the various aligners, with the greatest impact on *Bowtie2*-mapped data. There were 100 000 available reads per length bin in all simulated datasets. The vertical dashed lines on panel (a) indicate the average depth of coverage achieved for each MQ threshold considered

**Figure 4.**
The impact of ancient genomic datasets with differing sample-reference edit distances (sequence divergence). (a) AMBER plots for a non-USER-treated Siberian unicorn, a USER-treated American mastodon, and non-USER-treated maize and human mapped with *BWA-aln* and MQ ≥1; (b) mismatch and fragment length distribution plots for simulated datasets of 100% endogenous elephant reads with 1%–6% sequence divergence mapped with *BWA-aln* and MQ ≥1. Higher sequence divergences, especially those ≥3%, are greatly impacted by reference bias. Comparisons of up to 15% sequence divergence and a USER/non-USER comparison can be found in Supplementary Texts S5 and S6. There were 100 000 available reads per length bin in all simulated datasets. The vertical dashed lines on panel (a) indicate the average depth of coverage achieved in each species

See this image and copyright information in PMC

References

1. Briggs AW, Stenzel U, Johnson PLF. et al. Patterns of damage in genomic DNA sequences from a neandertal. Proc Natl Acad Sci USA 2007;104:14616–21. - PMC - PubMed
1. Briggs AW, Stenzel U, Meyer M. et al. Removal of deaminated cytosines and detection of in vivo methylation in ancient DNA. Nucleic Acids Res 2010;38:e87. - PMC - PubMed
1. Chen S, Zhou Y, Chen Y. et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018;34:i884–90. - PMC - PubMed
1. Fernandez-Guerra A, Borrel G, Delmont TO. et al. A 2-million-year-old microbial and viral communities from the Kap København Formation in North Greenland. bioRxiv, 10.1101/2023.06.10.544454, 2023, preprint: not peer reviewed. - DOI
1. Feuerborn TR, Palkopoulou E, van der Valk T. et al. Competitive mapping allows for the identification and exclusion of human DNA contamination in ancient faunal genomic datasets. BMC Genomics 2020;21:844. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

2021.0048/Knut and Alice Wallenberg Foundation

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Unravelling reference bias in ancient DNA datasets

Affiliations

Unravelling reference bias in ancient DNA datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous