Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Jan 30:2024.01.25.577285.
doi: 10.1101/2024.01.25.577285.

Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

Affiliations

Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs

Kristen D Curry et al. bioRxiv. .

Update in

Abstract

Bacterial genome dynamics are vital for understanding the mechanisms underlying microbial adaptation, growth, and their broader impact on host phenotype. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to absence of clear reference genomes and presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series. The log fold change in graph coverage between subsequent samples is then calculated to call SVs that are thriving or declining throughout the series. We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, which is particularly noticeable as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between subsequent time and temperature samples, suggesting host advantage. Our innovative approach leverages raw read patterns rather than references or MAGs to include all sequencing reads in analysis, and thus provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial genome dynamics.

Keywords: gene transfer; long-read sequencing; metagenome; microbiome; structural variants.

PubMed Disclaimer

Conflict of interest statement

Competing interests No competing interest is declared.

Figures

Fig. 1.
Fig. 1.
(a) To utilize rhea, first, microbiome series data must be collected and long whole genome sequencing reads generated. Then, within rhea, a coassembly graph of all reads in the series is created with metaFlye. Reads from each sample are then separately aligned to the coassembly graph with minigraph. Rhea evaluates log fold change in coverage between series steps for SV-specific patterns in the assembly graph to detect structural variants between steps. (b) Assembly graph patterns detected in rhea, which indicate potential insertions, deletions, complex indels, and tandem duplicates. Insertions and deletions are detected by observing a triangle where one node has a significantly higher (insertion) or lower (deletion) log fold change. Complex indels are noted by a square with one or two outliers; in the case of two outliers, the two outliers must be of opposing sides of the median and not have an edge between them. Tandem duplicates are detected by a log fold change of a self-loop edge coverage greater than 1.
Fig. 2.
Fig. 2.
(a) HGT simulation process completed in the HgtSIM publication (35). One gene is randomly selected from each of the 10 Alphaproteobacteria species, mutated with rate m, then inserted into each Betaproteobacteria. Mutations rates m = 0 and m = 30 are included in this study. (b) Simulated relative abundances for time points T0 and T1. T0 is a simulation of the 20 reference genomes in equal abundance; T1 is simulated from the 10 original Alphaproteobacteria species and the 10 mutated Betaproteobacteria species in varying abundances (c) Precision, recall, and F1-score for MetaCHIP (36) and rhea detected insertions for the mock community with mutation rates 0 and 30. Time point T1 is used for MetaCHIP results; change from T0 to T1 is used for rhea.
Fig. 3.
Fig. 3.
(a) Relative abundance of long reads for two simulated time points (T0, T1) for our ZymoBIOMICS community. Each of the 10 microbes were randomly given 20 indels, 10 tandem duplications, and 10 long complex indels to create a variant strain (15). T0 contains only the original references (R); T1 introduces the variants (V), where half the species have variants in equal abundance to their original reference [Escherichia coli (EC), Lactobacillus fermentum (LF), Pseudomonas aeruginosa (PA), Salmonella enterica (SE), Cryptococcus neoformans (CN)], and half the species are dominated by their variants [Bacillus subtilis (BS), Enterococcus faecalis (EF), Listeria monocytogenes (LM), Staphylococcus aureus (SA), Saccharomyces cerevisiae (SC)]. (b) Complete recall, precision, and F1-score for each of the SV types (Ins: insertion, Del: deletion, CI: complex indel, TD: tandem duplication) for both workflows (bar plots) and recall on a subset of 5 species (table). For the MAG workflow, MAGs were curated for T0 and T1 separately. Then, Mum & Co called SVs between T0 and T1 MAGs of matching taxonomic classification. The 5 species selected for the table are the 5 species with a classified MAG at both time points. The top portion (BS,SA) show the species where the variant dominates in T1; whereas both the variant and the original reference are present in T1 for the bottom portion (EC,PA,SE). The better recall is in bold for each comparison.
Fig. 4.
Fig. 4.
(a) SV counts detected by rhea for pairs of subsequent samples throughout cheese ripening (C1–4) for the entire community and exclusively the extracted Halomonas subgraph. (b) Previously established MGE contigs for 3 selected time points, described as either with (green) or without (red) Halomonas host by viralAssociationPipeline (vAP) per original publication’s findings. Grey boxes signify the MGE contigs that had a BLAST hit of > 5% query coverage to our Halomonas subgraph. (c) Rhea and Bandage generated visual for the log fold change in coverage for the Halomonas subgraph. Left shows the complete Halomonas subgraph between weeks 4 and 9 (C3), selected for showing a general decrease in abundance yet an increase in abundance for several subsequences. Right zooms in on a small portion of the subgraph containing an interesting evolutionary pattern, where the log fold change in coverage graph is shown for each pair of subsequent time points (C1–4). The graph node marked with a * indicates the node containing the predicted type I restriction-modification system.

References

    1. Abante J., Wang P.L., Salzman J.: DIVE: A reference-free statistical approach to diversity-generating and mobile genetic element discovery. Genome Biology 24(1), 240 (Oct 2023). 10.1186/s13059-023-03038-0 - DOI - PMC - PubMed
    1. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.: Basic local alignment search tool. Journal of Molecular Biology 215(3), 403–410 (Oct 1990). 10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Balaji A., Liu Y., Nute M.G., Hu B., D. Kappell A., S. Lesassier D., D. Godbold G., Ternus K., Treangen T.: SeqScreen-Nano: A computational platform for streaming, in-field characterization of microbial pathogens. In: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. pp. 1–10. BCB ‘23, Association for Computing Machinery, New York, NY, USA (Oct 2023). 10.1145/3584371.3612960 - DOI
    1. Benoit G., Raguideau S., James R., Phillippy A.M., Chikhi R., Quince C.: High-quality metagenome assembly from long accurate reads with metaMDBG. Nature Biotechnology pp. 1–6 (Jan 2024). 10.1038/s41587-023-01983-6 - DOI - PMC - PubMed
    1. Bhaya D., Grossman A.R., Steunou A.S., Khuri N., Cohan F.M., Hamamura N., Melendrez M.C., Bateson M.M., Ward D.M., Heidelberg J.F.: Population level functional diversity in a microbial community revealed by comparative genomic and metagenomic analyses. The ISME journal 1(8), 703–713 (Dec 2007). 10.1038/ismej.2007.46 - DOI - PubMed

Publication types