Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 27;26(1):372.
doi: 10.1186/s13059-025-03839-5.

CarpeDeam: a de novo metagenome assembler for heavily damaged ancient datasets

Affiliations

CarpeDeam: a de novo metagenome assembler for heavily damaged ancient datasets

Louis Kraft et al. Genome Biol. .

Abstract

De novo assembly of ancient metagenomic datasets is a challenging task. Ultra-short fragment size and characteristic postmortem damage patterns of sequenced ancient DNA molecules leave current tools ill-equipped for ideal assembly. We present CarpeDeam, a novel damage-aware de novo assembler designed specifically for ancient metagenomic samples. Utilizing maximum-likelihood frameworks that integrate sample-specific damage patterns, CarpeDeam demonstrates improved recovery of longer continuous sequences and protein sequences in many simulated and empirical datasets compared to existing assemblers. As a pioneering ancient metagenome assembler, CarpeDeam opens the door for new opportunities in functional and taxonomic analyses of ancient microbial communities.

Keywords: Ancient DNA; De novo assembly; Metagenomics; Microbes; Proteins.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not relevant for the current study. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
A Impact of aDNA damage and fragment length on metagenomic assembly. The plots show the sum of all contigs larger than 500 bp after assembling a simulated toy dataset with either MEGAHIT [58] or metaSPAdes [59]. Each bar plot refers to a different combination of fragment length and damage rate of the simulated data. While empirical data is inherently more complex in terms of variation in fragment lengths and damage patterns, our simplified dataset demonstrates how common assemblers are limited by aDNA damage. B Damage patterns used for the simulation of aDNA fragments. We used two levels of deamination rates for the simulations: moderate damage and mild damage compared to the rates found in ancient microbial studies [, , , –57]. The blue traces represent C-to-T substitution rates, while the red traces indicate G-to-A substitution rates
Fig. 2
Fig. 2
CarpeDeam’s main workflow: The input are aDNA sequences (FASTQ format) which have been trimmed and, for paired-end data, merged. During an iterative process, the fragments are corrected and extended to long contigs. In PHASE 1, the fragments are grouped into clusters sharing at least one k-mer as well as an overlap sequence identity of 99% in RYmer space. PHASE 2 corrects deaminated bases. In particular, the center sequence of each cluster (which is always the longest sequence) is assigned the most likely base per position given the evidence of overlapping sequences in the cluster and the user-provided damage patterns. In PHASE 3, the center sequence of each cluster is extended by the candidate sequence from the cluster that is most likely to be the correct extension. In fact, PHASE 3 is divided into two steps. First, only aDNA fragments (non-extended sequences) are taken into account for extension, as the provided damage patterns are only valid for non-extended sequences. In the second step of PHASE 3 exclusively contigs (sequences that already have been extended at least once) are used for the extension, applying a modified Bayesian extension model from the native PenguiN assembler
Fig. 3
Fig. 3
Performance evaluation of assemblers CarpeDeam (safe and unsafe modes), MEGAHIT, metaSPAdes, and PenguiN across nine simulated datasets. Results are presented for datasets with category moderate damage and short fragment length distribution, simulated for three environments (gut, dental calculus, and bone) and three coverage levels (3×, 5×, and 10×). The metrics shown are largest alignment (row 1), misassemblies per contig (row 2), genome fraction (row 3), and NA50 (row 4). Each bar represents the performance of an assembler for a specific metric, coverage, and environment
Fig. 4
Fig. 4
Analysis of mapped fractions of base pairs, genomic features, and RNA recovery across different assemblers and coverage levels for the gut dataset. A Distribution of mapping categories (mapped non-duplicated, mapped duplicated variant, mapped duplicated redundant representative, mapped duplicated redundant, unmapped non-duplicated, and unmapped duplicated base pairs) for assemblies of the gut dataset at 10×, 5×, and 3× coverage levels. B Types of genomic features recovered in contigs based on Prokka annotations, for different assemblers (gut dataset). C Recovery of rRNAs and tRNAs with sequence identities above and below 98% for MEGAHIT and CarpeDeam (safe and unsafe modes) across different coverage levels (gut dataset; moderate damage, short fragment length distribution)
Fig. 5
Fig. 5
Evaluation of predicted protein sequences. Results are presented for CarpeDeam (safe and unsafe modes), MEGAHIT, metaSPAdes, and PenguiN across datasets with moderate damage and short fragment lengths in three simulated environments (bone, dental calculus, and gut) with varying coverage levels (3×, 5×, and 10×). The figure shows the number of predicted ORFs with significant similarity to reference proteins, filtered for alignments covering  90% of the reference protein and  90% sequence similarity
Fig. 6
Fig. 6
A Heatmap of unique UniRef100 protein hits for ORFs predicted from contigs assembled by CarpeDeam, MEGAHIT, PenguiN, and metaSPAdes across empirical samples (Grouped by sample site). Hits were filtered by E-value e-12, 35% identity, alignment length 100 residues. B Venn diagrams of species-level taxonomic assignments from translated contigs queried against the Genome Taxonomy Database for the Datasets GDN001.A0101 and EMN001.A0101. C Recovered genome fraction for highly damaged taxa, as reported by metaQUAST
Fig. 7
Fig. 7
A Detection of 16 s rRNA genes. Shown are the numbers of unique hits in the SILVA database from assembled contigs that were annotated with Prokka and filtered by sequence identity thresholds (90% to 100%) and a minimum coverage of 80%. B Identification of BGC protoclusters in the OAK003 sample using antiSMASH

References

    1. Margaryan A, Lawson DJ, Sikora M, Racimo F, Rasmussen S, Moltke I, et al. Population genomics of the Viking world. Nature. 2020;585(7825):390–6. 10.1038/s41586-020-2688-8. - PubMed
    1. Kjær KH, Winther Pedersen M, De Sanctis B, De Cahsan B, Korneliussen TS, Michelsen CS, et al. A 2-million-year-old ecosystem in Greenland uncovered by environmental DNA. Nature. 2022;612(7939):283–91. 10.1038/s41586-022-05453-y. - PMC - PubMed
    1. Fernandez-Guerra A, Borrel G, Delmont TO, Elberling B, Eren AM, Gribaldo S, et al. A 2-million-year-old microbial and viral communities from the Kap København Formation in North Greenland. Cold Spring Harbor Laboratory; 2023. 10.1101/2023.06.10.544454.
    1. Vogel NA, Rubin JD, Swartz M, Vlieghe J, Sackett PW, Pedersen AG, et al. Euka: robust tetrapodic and arthropodic taxa detection from modern and ancient environmental DNA using pangenomic reference graphs. Methods Ecol Evol. 2023;14(11):2717–27. 10.1111/2041-210x.14214.
    1. Prüfer K, Stenzel U, Hofreiter M, Pääbo S, Kelso J, Green RE. Computational challenges in the analysis of ancient DNA. Genome Biol. 2010;11(5):R47. 10.1186/gb-2010-11-5-r47. - PMC - PubMed

LinkOut - more resources