Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Sep 18:2025.02.12.637985.
doi: 10.1101/2025.02.12.637985.

Synthetic community Hi-C benchmarking provides a baseline for virus-host inferences

Affiliations

Synthetic community Hi-C benchmarking provides a baseline for virus-host inferences

Rokaiya Nurani Shatadru et al. bioRxiv. .

Update in

Abstract

Microbiomes influence diverse ecosystems, and viruses increasingly appear to impose key constraints. While viromics has expanded genomic catalogs, host identification for these viruses remains challenging due to the limitations in scaling cultivation-based approaches and the uncertain reliability and relative low resolution of in silico predictions - particularly for understudied viral taxa. Towards this, Hi-C proximity ligation uses sequenced, cross-linked virus and host genomic fragments to infer virus-host linkages and has now been applied in at least ten studies. However, its accuracy remains unknown. Here we assess Hi-C performance in recovering virus-host interactions using synthetic communities (SynComs) composed of four marine bacterial strains and nine phages with known interactions and then apply optimized bioinformatic protocols to natural soil samples. In SynComs, standard Hi-C sample preparations and analyses showed poor normalized contact score performance (26% specificity, 100% sensitivity, incorrect matches up to class level) that could be dramatically improved by Z-score filtering (Z ≥ 0.5, 99% specificity), though at reduced sensitivity (62% down from 100%). Detection limits were established as reproducibility was poor below minimal phage abundances of 105 PFU/mL. Applying optimized bioinformatic protocols to natural soil samples, we compared virus-host linkages inferred from proximity-ligated Hi-C sequencing with predictions generated by in silico homology-based and machine learning-based bioinformatic approaches. Prior to Z-score thresholding, agreement was relatively high at the phylum to family levels (72%), but not at the genus (43%) or species (15%) levels. Z-score thresholding reduced sensitivity (only 34% of predictions were retained), with only modest improvements in congruence with bioinformatic methods (48% or 18% at genus or species levels, respectively). Regardless, this led to 79 genus-level-congruent virus-host linkages and 293 new ones revealed by Hi-C alone - i.e., providing many new virus-host interactions to explore in already well-studied climate-critical soils. Overall, these findings provide empirical benchmarks and methodological guidelines to improve the accuracy and reliability of Hi-C for virus-host linkage studies in complex microbial communities.

Keywords: Genomics; Hi-C; Virus-Host Interactions.

PubMed Disclaimer

Conflict of interest statement

Competing Interests None

Figures

Fig 1.
Fig 1.. Synthetic communities and experimental schema used to assess Hi-C virus-host linkages.
A. Synthetic communities were built from four bacterial strains (CBA = Cellulophaga baltica; PSA = Pseudoalteromonas) and 9 phages (listed fully in S1 Table) that were experimentally evaluated for infection in pairwise combinations via traditional plaque assays. Black boxes denote that the virus successfully plaques on the bacterial strain, whereas white (missing) boxes denote a negative, non-plaquing interaction. B. Schematic representation of the Hi-C experiment used to test virus-host relationships. After generating the synthetic community with the organisms mentioned above, Hi-C libraries were prepared and sequenced. Subsequently, bioinformatic analyses were performed to determine whether the expected virus-host linkages known from pairwise isolate-based experiments (denoted with black boxes) were observed.
Fig 2.
Fig 2.. Hi-C linkages from SynCom-1.
A. Contact scores (left) and corresponding Z-scores (right) calculated for each replicate of SynCom-1, categorized by host strains. The contact score represents the number of Hi-C linkages between a virus and a host genome, normalized for the number of restriction sites, genome length, and coverage. Z-scores were calculated from the contact scores within each sample to enable comparison across samples. The black dots indicate correct virus-host linkages, the grey dots indicate incorrect virus-host linkages. The red vertical dotted line is drawn at Z-score = 0.5. B. Virus-host linkages determined from a non-zero contact score (left) or using a filtering approach (i.e., requiring a Z-score generated from non-score normalized Hi-C scores above 0.5; right). The black boxes denote true positives, the grey boxes denote false positives, and the stripped boxes denote false negatives.
Fig 3.
Fig 3.. Cryopreservation experiment to assess impact on SynCom-1 Hi-C linkages.
A. All Z-scores (left) and virus-host linkages (right) for each replicate of SynCom-1 cryopreserved with DMSO and categorized by host strains. Black and gray dots indicate correct or incorrect virus-host linkages, respectively, while black, gray, and striped boxes indicate true and false positives, and false negatives, respectively. The red vertical dotted line is drawn at Z-score = 0.5. B. Same data type as A, but for betaine-preserved samples. C. Average sensitivity (gray bar) and specificity (black bar) rates calculated for SynCom-1 treated with and without cryoprotective agents.
Fig 4.
Fig 4.. Detection limit experiment to evaluate Hi-C linkages in varied concentration SynCom-2 and SynCom-3.
A. All Z-scores (left) and virus-host linkages (right) for each replicate of SynCom-2, categorized by host strains. All figure elements are the same as described in Fig 3. B. Same data type as A, but for SynCom-3. C. Average sensitivity and specificity rates calculated for SynCom-1, SynCom-2, and SynCom-3 without cryoprotective agents. The grey bar represents sensitivity, and the black bar represents specificity.
Fig 5.
Fig 5.. Comparison of virus-host prediction from Hi-C and in silico tools.
A. Eular plot showing the overlap of viruses with host predictions obtained from the experimental Hi-C linkage approach, or one of two in silico tools (iPHoP and VirMatcher) that use different probabilistic models to aggregate output of various sequence-based features to create host prediction scores. B. Comparison of virus-host predictions across all samples between Hi-C and iPHoP, shown with and without applying a Z-score filter for the Hi-C linkages. Black bars indicate congruent predictions identified from both tools and grey bars indicate non-congruent predictions. Note: Although many viruses had multiple predicted hosts from each tool, only the top-scoring prediction for each virus was considered in this comparison.

References

    1. Falkowski PG, Fenchel T, Delong EF. The microbial engines that drive Earth’s biogeochemical cycles. Science. 2008. May 23;320(5879):1034–9. - PubMed
    1. Trivedi P, Leach JE, Tringe SG, Sa T, Singh BK. Plant–microbiome interactions: from community assembly to plant health. Nat Rev Microbiol. 2020. Nov;18(11):607–21. - PubMed
    1. Levin D, Raab N, Pinto Y, Rothschild D, Zanir G, Godneva A, et al. Diversity and functional landscapes in the microbiota of animals in the wild. Science. 2021. Mar 25;372(6539):eabb5352. - PubMed
    1. Fan Y, Pedersen O. Gut microbiota in human metabolic health and disease. Nat Rev Microbiol. 2021. Jan;19(1):55–71. - PubMed
    1. Brum JR, Sullivan MB. Rising to the challenge: accelerated pace of discovery transforms marine virology. Nat Rev Microbiol. 2015. Mar;13(3):147–59. - PubMed

Publication types

LinkOut - more resources