. 2021 Nov 18;37(22):4202-4208.

doi: 10.1093/bioinformatics/btab451.

efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Ahmed A Zayed^{1

2}, Dominik Lücking³, Mohamed Mohssen^{1

2

4}, Dylan Cronin¹, Ben Bolduc¹, Ann C Gregory^{5

6}, Katherine R Hargreaves^{1

7}, Paul D Piehowski⁸, Richard A White Iii^{9

10

11

12}, Eric L Huang⁸, Joshua N Adkins⁸, Simon Roux¹³, Cristina Moraru¹⁴, Matthew B Sullivan^{1

2

4

15}

Affiliations

¹ Department of Microbiology, The Ohio State University, Columbus, OH 43210, USA.
² Center of Microbiome Science, The Ohio State University, Columbus, OH 43210, USA.
³ Max-Planck-Institut fuer Marine Mikrobiologie, Bremen 28359, Germany.
⁴ The Interdisciplinary Biophysics Graduate Program, The Ohio State University, Columbus, OH 43210, USA.
⁵ Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven 3000, Belgium.
⁶ VIB-KU Leuven Center for Microbiology, Leuven, Belgium.
⁷ Department of Life Sciences, Manchester Metropolitan University, Manchester M1 5GD, UK.
⁸ Earth and Biological Sciences Directorate, PNNL, Richland, WA 99354, USA.
⁹ Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte 9201 University City Boulevard, Charlotte, NC 28223, USA.
¹⁰ Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte 150 Research Campus Drive, Kannapolis, NC 28081, USA.
¹¹ Australian Centre for Astrobiology, University of New South Wales, Sydney, NSW 2052, Australia.
¹² RAW Molecular Systems (RAW), INC, Concord, NC 28025, USA.
¹³ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
¹⁴ The Institute for Chemistry and Biology of the Marine Environment (ICBM), University of Oldenburg, Oldenburg 26111, Germany.
¹⁵ Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH 43210, USA.

PMID: 34132786
PMCID: PMC9502166
DOI: 10.1093/bioinformatics/btab451

efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Ahmed A Zayed et al. Bioinformatics. 2021.

. 2021 Nov 18;37(22):4202-4208.

doi: 10.1093/bioinformatics/btab451.

Authors

Affiliations

¹ Department of Microbiology, The Ohio State University, Columbus, OH 43210, USA.
² Center of Microbiome Science, The Ohio State University, Columbus, OH 43210, USA.
³ Max-Planck-Institut fuer Marine Mikrobiologie, Bremen 28359, Germany.
⁴ The Interdisciplinary Biophysics Graduate Program, The Ohio State University, Columbus, OH 43210, USA.
⁵ Department of Microbiology and Immunology, Rega Institute, KU Leuven, Leuven 3000, Belgium.
⁶ VIB-KU Leuven Center for Microbiology, Leuven, Belgium.
⁷ Department of Life Sciences, Manchester Metropolitan University, Manchester M1 5GD, UK.
⁸ Earth and Biological Sciences Directorate, PNNL, Richland, WA 99354, USA.
⁹ Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte 9201 University City Boulevard, Charlotte, NC 28223, USA.
¹⁰ Department of Bioinformatics and Genomics, The University of North Carolina at Charlotte 150 Research Campus Drive, Kannapolis, NC 28081, USA.
¹¹ Australian Centre for Astrobiology, University of New South Wales, Sydney, NSW 2052, Australia.
¹² RAW Molecular Systems (RAW), INC, Concord, NC 28025, USA.
¹³ DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
¹⁴ The Institute for Chemistry and Biology of the Marine Environment (ICBM), University of Oldenburg, Oldenburg 26111, Germany.
¹⁵ Department of Civil, Environmental and Geodetic Engineering, The Ohio State University, Columbus, OH 43210, USA.

PMID: 34132786
PMCID: PMC9502166
DOI: 10.1093/bioinformatics/btab451

Abstract

Motivation: Viruses infect, reprogram and kill microbes, leading to profound ecosystem consequences, from elemental cycling in oceans and soils to microbiome-modulated diseases in plants and animals. Although metagenomic datasets are increasingly available, identifying viruses in them is challenging due to poor representation and annotation of viral sequences in databases.

Results: Here, we establish efam, an expanded collection of Hidden Markov Model (HMM) profiles that represent viral protein families conservatively identified from the Global Ocean Virome 2.0 dataset. This resulted in 240 311 HMM profiles, each with at least 2 protein sequences, making efam >7-fold larger than the next largest, pan-ecosystem viral HMM profile database. Adjusting the criteria for viral contig confidence from 'conservative' to 'eXtremely Conservative' resulted in 37 841 HMM profiles in our efam-XC database. To assess the value of this resource, we integrated efam-XC into VirSorter viral discovery software to discover viruses from less-studied, ecologically distinct oxygen minimum zone (OMZ) marine habitats. This expanded database led to an increase in viruses recovered from every tested OMZ virome by ∼24% on average (up to ∼42%) and especially improved the recovery of often-missed shorter contigs (<5 kb). Additionally, to help elucidate lesser-known viral protein functions, we annotated the profiles using multiple databases from the DRAM pipeline and virion-associated metaproteomic data, which doubled the number of annotations obtainable by standard, single-database annotation approaches. Together, these marine resources (efam and efam-XC) are provided as searchable, compressed HMM databases that will be updated bi-annually to help maximize viral sequence discovery and study from any ecosystem.

Availability and implementation: The resources are available on the iVirus platform at (doi.org/10.25739/9vze-4143).

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Computational workflow used to construct efam and efam-XC. This pipeline illustrates the major steps (rectangles) followed to generate efam and efam-XC. Whenever applicable, software used in each step are shown in parentheses. Viral sequences from GOV2.0 were re-analyzed by three different viral prediction tools and extremely conservative subsets of these predictions were used downstream after decontamination (i.e. removal of prokaryotic genes from any potential prophage contig) by CheckV. Open reading frames (ORFs) on each viral contig were predicted and the protein sequences that were 95% locally similar to bacterial or archaeal proteins were removed. The remaining proteins were then dereplicated and clustered, and the sequences within each cluster were multiple aligned. Finally, HMM profiles were built out of each alignment and the profiles were pressed into a searchable HMM database. Statistics for each step in the generation of the efam and efam-XC are shown in the bottom right corner of each box

**Fig. 2.**
Stringency levels used for selecting the viral contigs contributing to efam and efam-XC. The Venn-diagram shows the extent of agreement between VirSorter, DeepVirFinder and MARVEL at the highest stringency levels of each program. The intersection of the highest stringency of at least two programs was used to construct efam, while the intersection of the highest stringency of all three programs was used to construct efam-XC

**Fig. 3.**
Comparison of viral HMM database sizes and clustering algorithms. Number of clusters (HMM profiles) in efam, efam-XC and currently available public databases. SFams, another HMM profile database (Sharpton *et al.*, 2012), was excluded from our comparisons because it did not include any viral genomes in its construction. (Inset) Clustering structure produced by ClusterOne and MCL for efam (right) and efam-XC (left). The number of clusters on the x-axes were capped at 200 000 (efam) and 20 000 (efam-XC) for visibility. ClusterOne generally produced longer tails (more clusters) and larger clusters except for the highly ranked clusters. Since ClusterOne was instructed to apply a ‘hair-trimming’ step after the clustering to remove dangling nodes and since the highly ranked clusters have more representative sequences that are used to build the HMM profiles, we felt comfortable proceeding with well-trimmed slightly smaller high-rank clusters. The number of protein sequences used for clustering in each database is listed in (Supplementary Table S4)

**Fig. 4.**
efam-XC enables viral discovery in metagenomes. The paired dot plot (A) shows that the number of recovered viral contigs from every single ETSP-OMZ virome increased upon integrating efam-XC in VirSorter. As a result, the median and average number of viral contigs recovered per sample (B) increased for the new implementation of VirSorter, with the average increasing from 2904 to 3558 viral contigs per sample (22.5% increase)

**Fig. 5.**
efam-XC enhances the recovery of short viral contigs and increases confidence level in identified contigs. (A) Percent increase in the number of viral contigs recovered by VirSorter from the deeply sequenced LineP sample at different contig sizes upon integrating efam-XC into VirSorter. (B) Percent decrease in the number of low-confidence viral contigs (Cat 3 and Cat 6 of VirSorter) and the prophage category (Cat 5) upon integrating efam-XC into VirSorter. The viral contigs that were removed from these categories were added to the high-confidence categories (Cat 1 and Cat 2), except for 2 contigs from Cat 6 which were moved to Cat 3. The numbers next to each bin are from the VirSorter run before integrating efam-XC

See this image and copyright information in PMC

References

1. Amgarten D. et al. (2018) MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet., 9, 304. - PMC - PubMed
1. Bickhart D.M. et al. (2019) Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation. Genome Biol., 20, 153. - PMC - PubMed
1. Bolduc B. et al. (2017) iVirus: facilitating new insights in viral ecology with software and community data sets imbedded in a cyberinfrastructure. ISME J., 11, 7–14. - PMC - PubMed
1. Boratto P.V.M. et al. (2020) A mysterious 80 nm amoeba virus with a near-complete “ORFan genome” challenges the classification of DNA viruses. bioRxiv, doi: 10.1101/2020.01.28.923185.
1. Brum,J.R. et al. (2016) Illuminating structural proteins in viral “dark matter” with metaproteomics. Proc. Natl. Acad. Sci. USA, 113, 2436–2441. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Affiliations

efam: an expanded, metaproteome-supported HMM profile database of viral protein families

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources