Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jul;6(7):960-970.
doi: 10.1038/s41564-021-00928-6. Epub 2021 Jun 24.

Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome

Affiliations

Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome

Stephen Nayfach et al. Nat Microbiol. 2021 Jul.

Abstract

Bacteriophages have important roles in the ecology of the human gut microbiome but are under-represented in reference databases. To address this problem, we assembled the Metagenomic Gut Virus catalogue that comprises 189,680 viral genomes from 11,810 publicly available human stool metagenomes. Over 75% of genomes represent double-stranded DNA phages that infect members of the Bacteroidia and Clostridia classes. Based on sequence clustering we identified 54,118 candidate viral species, 92% of which were not found in existing databases. The Metagenomic Gut Virus catalogue improves detection of viruses in stool metagenomes and accounts for nearly 40% of CRISPR spacers found in human gut Bacteria and Archaea. We also produced a catalogue of 459,375 viral protein clusters to explore the functional potential of the gut virome. This revealed tens of thousands of diversity-generating retroelements, which use error-prone reverse transcription to mutate target genes and may be involved in the molecular arms race between phages and their bacterial hosts.

PubMed Disclaimer

Conflict of interest statement

P.H. is a co-founder of Microba Life Sciences, which is a microbial genomics company developing microbiome-based diagnostics and therapeutics and offers metagenomic gut microbiome reports. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Thousands of high-quality viral genomes recovered from human gut metagenomes.
a, Overview of viral discovery effort and formation of the MGV catalogue. b, Genomic signatures of predicted viral and non-viral metagenomic contigs longer than 20 kb. Displayed data is for 1,000 randomly selected contigs from each category. c, Distribution of estimated genome completeness and classification of MGVs into quality tiers (complete, n = 26,030; >90% complete, n = 53,220; 50–90% complete, n = 110,430; <50% complete, n = 2,620,162; completeness not determined, n = 671,842). d, Metadata and annotations for 189,680 genomes with >50% completeness. For box plots, the middle line denotes the median, the box denotes the interquartile range and the whiskers denote 1.5× the interquartile range.
Fig. 2
Fig. 2. Viral connections to human gut Bacteria and Archaea.
a, Bar plots indicating the number of CRISPR spacers across 286,997 human gut Bacteria and Archaea, with the number of genomes indicated in parentheses. Each row indicates one host class containing at least 20 genomes and 100 spacers. The majority of CRISPR spacers are derived from Clostridia and Bacteroidia, reflecting their abundance in the human gut. b, Percentage of CRISPR spacers matching viral genomes with a maximum of one mismatch. c, Host genomes containing a CRISPR-spacer array, and those with a CRISPR-spacer array match to a viral genome. d, Genomes linked to a virus using a combination of approaches as indicated. e, Distribution of known viral families that are associated with each host class. Each host class is infected by a distinct repertoire of viral families.
Fig. 3
Fig. 3. Genome clustering and comparison with existing databases.
The 189,680 genomes from the MGV catalogue were compared with human gut virus genomes >50% complete from three databases: IMG/VR (n = 6,895), HuVirDB (n = 9,626) and GVD (n = 4,494). a, Viral genomes were clustered into vOTUs at approximately species, genus and family levels. b, Accumulation curves for vOTUs from the MGV catalogue. c, Percentage of reads from 1,257 unfiltered stool metagenomes, percentage of reads from 585 viral stool metagenomes and percentage of CRISPR spacers from 286,997 UHGG genomes mapped to viral genomes from various databases.
Fig. 4
Fig. 4. Phylogenomics of intestinal Caudovirales.
A phylogenetic tree was constructed from 25,528 species-level genomes derived from the MGV and other databases (IMG/VR, HuVirDB and GVD). a, Phylogeny of intestinal Caudovirales. Tree was plotted using iToL and to improve visualization only one genome per genus-level vOTU is displayed. Branch colour indicates whether a lineage is represented by a previously published study (black) or is unique to the MGV catalogue (green). Outer rings display metadata for each vOTU. b, PD was calculated by taking the sum of branch lengths represented by species-level viral genomes. c,d, MGVs from the current study result in a large gain in PD, which is consistent across (c) viral families and (d) viruses infecting different host groups.
Fig. 5
Fig. 5. Functional landscape of intestinal phages.
a, Protein-coding viral genes were identified across all MGVs and compared with profile HMMs from five databases. b, Forty-five per cent of genes fail to match any HMM, 30% match an HMM of unknown function and 25% match an HMM of known function. c, The 11,837,198 genes were clustered at 30% AAI using MMseqs2 into 459,375 protein clusters. d, Size distribution of protein clusters. e, An accumulation curve of protein clusters has not reached an asymptote. f, Functional annotations for the largest 75 protein clusters. Reverse transcriptases are highlighted in red. g, Prediction of DGRs based on the combination of the reverse transcriptase gene (PF00078) and TR–VR pair identified using DGRscan. A large fraction of MGVs contain the DGR system. h, DGR prevalence across different categories of viruses. DGRs are most common in lysogenic, dsDNA viruses from the Myoviridae family.
Extended Data Fig. 1
Extended Data Fig. 1. Impact of assembly methods on viral recovery from gut metagenomes.
The MGV catalogue was formed using metagenomic viral contigs identified from three studies that performed large-scale assembly of human stool metagenomes. The CIBIO and MGnify studies used MetaSPAdes for metagenomic assembly while the JGI study used MEGAHIT. To explore the effect of assembler on virus identification, we compared viral contigs identified from a common set of 752 stool samples which were assembled by all three studies and were each represented by a single SRA run accession. a, The number of vOTUs represented by viral contigs (>50% completeness) from each of the three studies. A similar number of vOTUs were identified from metagenomic contigs assembled by each study. b, The number of viral contigs at different quality levels identified from each of the three studies. A greater number of complete and high-quality viral genomes are recovered from the MEGAHIT assemblies.
Extended Data Fig. 2
Extended Data Fig. 2. Diversity of jumbo phages identified in the MGV dataset.
The tree includes MGV sequences alongside a reference set of metagenome-assembled jumbo phages published by Al-Shayeb et al.. Branches leading to MGV sequences, or clades composed exclusively of MGV sequences, are highlighted in red. Nodes with support < 50% were collapsed, and nodes with support ≥ 80% are indicated with a grey circle on the corresponding branch. Outer rings indicate the genome quality and continent of origin for MGV sequences. When sequences from different continents were 100% identical and only 1 sequence was included in the tree, the different continents of origin are indicated with stacked coloured squares. For box plots, the middle line denotes the median, the box denotes the interquartile range (IQR), and the whiskers denote 1.5× the IQR.
Extended Data Fig. 3
Extended Data Fig. 3. Strain level phylogeography of prevalent human gut phages.
Core-genome SNP phylogenies were constructed for individual species-level vOTUs with at least 100 genomes. The figure shows three distinct vOTUs displaying a strong signature of phylogeography. For each tree, viral genomes are displayed as tips with colours indicating the geographic origin of the metagenomic sample.
Extended Data Fig. 4
Extended Data Fig. 4. Antibiotic resistance genes identified from 11.8 million viral proteins.
a-b, Viral genes with putative beta-lactamase domains identified based on hits to the Pfam and KEGG databases, respectively. c-e, Resistance genes (including beta-lactamases) identified using Resfinder, AMRfinder, or the Resistance Gene Identifier (RGI), respectively. f, Overlap of resistance genes identified by Resfinder, AMRfinder, and RGI. Most viral proteins identified with putative beta-lactamase domains are not confirmed as antibiotic resistance genes.
Extended Data Fig. 5
Extended Data Fig. 5. Comparison of viral contigs from the MGV and GPD catalogues.
a, The number of viral contigs with at least 50% completeness from the MGV and GPD catalogues. The GPD catalogue contains 142,809 viral contigs when including those with <50% completeness. Contigs from each catalogue where clustered at 95% ANI over 85% the length of the shorter sequence to form species-level vOTUs. b, MGV and GPD catalogues were clustered together using the longest contig from each vOTU. c, The histograms show the similarity between contigs from the MGV (n = 54,118) and GPD (n = 46,480) catalogues. d, Similarity to the GPD catalogue for MGV contigs from different viral families: Siphoviridae (n = 22,513), Podoviridae (n = 5,075), Myoviridae (n = 2,560), crAss-like (n = 948), Caudovirales other (n = 19,633), Microviridae (n = 2,133), CRESS DNA (n = 115), other (n = 1,141).

References

    1. Lynch SV, Pedersen O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 2016;375:2369–2379. - PubMed
    1. Ogilvie LA, et al. Genome signature-based dissection of human gut metagenomes to extract subliminal viral sequences. Nat. Commun. 2013;4:2420. - PMC - PubMed
    1. Reyes A, et al. Viruses in the faecal microbiota of monozygotic twins and their mothers. Nature. 2010;466:334–338. - PMC - PubMed
    1. Gogokhia L, et al. Expansion of bacteriophages is linked to aggravated intestinal inflammation and colitis. Cell Host Microbe. 2019;25:285–299. - PMC - PubMed
    1. Clooney AG, et al. Whole-virome analysis sheds light on viral dark matter in inflammatory bowel disease. Cell Host Microbe. 2019;26:764–778. - PubMed

Publication types