Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Affiliations

¹ Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece.
² Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States.
³ The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus.
⁴ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom.
⁵ John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, United States.
⁶ Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece.
⁷ Center of New Biotechnologies and Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece.
⁸ Hellenic Army Academy, Vari, Greece.

PMID: 36959975
PMCID: PMC10029925
DOI: 10.3389/fbinf.2023.1157956

Review

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Fotis A Baltoumas et al. Front Bioinform. 2023.

. 2023 Mar 3:3:1157956.

doi: 10.3389/fbinf.2023.1157956. eCollection 2023.

Affiliations

¹ Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, Greece.
² Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA, United States.
³ The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus.
⁴ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom.
⁵ John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, United States.
⁶ Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece.
⁷ Center of New Biotechnologies and Precision Medicine, Department of Medicine, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece.
⁸ Hellenic Army Academy, Vari, Greece.

PMID: 36959975
PMCID: PMC10029925
DOI: 10.3389/fbinf.2023.1157956

Abstract

Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.

Keywords: biodiversity; cluster annotation; metagenomes; metatranscriptomes; microbial dark matter; protein clustering; protein families.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Illustration of a typical metagenomic analysis. **(A)** Sample collection, **(B)** Marker gene detection and taxonomic assignment. **(C)** DNA reads are mapped to a reference genome. **(D)** DNA reads are assembled into contigs using *de novo* assembly.

**FIGURE 2**
Gene calling and annotation in IMG/M **(A)**, MGnify **(B)** and MG-RAST **(C)**. Simplified overviews of the three workflows are shown. Gene calling operations (RNA or protein) are colored salmon pink, while gene annotation operations are colored light green. The tools used in each workflow are given in the graph and described in the main text. The workflows are based on the methodology described in Clum et al. (2021), Mitchell et al. (2019) and Meyer et al. (2019).

**FIGURE 3**
Sequence based Clustering. **(A)** A k-mer example, **(B)** Possible clusters based on common k-mers. **(C)** Different types of sequence assignment to clusters based on the alignment length coverage.

**FIGURE 4**
Graph-based family generation. **(A)** Sample collection, **(B)** All-against-all comparison. **(C)** SSN creation after applying, for example, an edge threshold of 50% identity, 50% alignment length. **(D)** Graph-based clustering.

**FIGURE 5**
Schematic representation of different 3D modeling approaches. **(A)** Homology modeling. **(B)** Threading. **(C)** Sequence coevolution. **(D)** Deep learning (the AlphaFold2 model is shown as an example).

**FIGURE 6**
Example structure search and functional annotation for a set of predicted 3D structures. In the first step, the models are filtered to keep only high-quality models, typically represented by a high predicted TM-score (pTM) value. The models are also clustered based on their structural similarity. The high quality, non-redundant set of models can then be searched against databases of structural domains (e.g., CATH-Gene3D, SCOP and SCOPe) with a fast, TM-score based method such as TM-align. Models with significant hits (TM-score ≥0.50) are functionally annotated based on their structural homologs. Models with no hits (TM-score <0.50) are further searched against databases of full-length structures (containing one or multiple domains), biological assemblies or protein-protein complexes (PDB, ModelArchive, AlphaFoldDB, *etc.*) with a multimeric complex-enabled search method such as MM-align. Again, models with significant hits are functionally annotated based on their homologs. Finally, models with no hits to any structural database can be considered as potential novel folds.

**FIGURE 7**
Example of a gene neighborhood analysis for a cluster of unannotated metagenome sequences, represented as “M-gene”. **(A)** Simplified visualization of a synteny analysis for seven metagenome scaffolds, containing members of the M-gene cluster. Each gene is represented by an arrow and colored differently. The direction of the arrows represents the directionality of the ORFs in each scaffold. In the analyzed scaffolds, the M-gene ORF co-occurs with a number of other protein-coding genes, each corresponding to a Pfam domain. **(B)** Gene co-occurrence network, based on the results of the synteny analysis. Each node represents a protein-coding gene and is colored using the same scheme as in **(A)**. Edges (interactions) between nodes are derived based on the co-occurrence of their genes in the same scaffold. As it can be seen, the unannotated metagenomic cluster (M-gene) co-occurs with a tightly connected group of Pfam domains (Phosphoesterase, Glyco_tran_WecG, Polysacc_deacc_1 and Glyco_hydro_39), which are all found in the same scaffolds alongside M-gene members. In addition, M-gene co-occurs with RmID_sub_bind. Notably, four of the co-occurring protein domains are in the same functional category (Cell wall/membrane/envelope biogenesis), as indicated by their annotation in COG. This could mean that the unannotated M-gene cluster may participate in this function as well. The network was constructed using NORMA (Karatzas et al., 2022b).

See this image and copyright information in PMC

References

1. Akhter S., Aziz R. K., Edwards R. A. (2012). PhiSpy: A novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126. 10.1093/nar/gks406 - DOI - PMC - PubMed
1. Alneberg J., Bjarnason B. S., de Bruijn I., Schirmer M., Quick J., Ijaz U. Z., et al. (2013). Concoct: Clustering cONtigs on COverage and ComposiTion. 10.48550/ARXIV.1312.4038 - DOI - PubMed
1. Altschuh D., Lesk A. M., Bloomer A. C., Klug A. (1987). Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193, 693–707. 10.1016/0022-2836(87)90352-4 - DOI - PubMed
1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. 10.1016/s0022-2836(05)80360-2 - DOI - PubMed
1. Amgarten D., Braga L. P. P., da Silva A. M., Setubal J. C. (2018). MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304. 10.3389/fgene.2018.00304 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Affiliations

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources