Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Mar 3:3:1157956.
doi: 10.3389/fbinf.2023.1157956. eCollection 2023.

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Affiliations
Review

Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters

Fotis A Baltoumas et al. Front Bioinform. .

Abstract

Metagenomics has enabled accessing the genetic repertoire of natural microbial communities. Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various environments. To this end, several methods have been developed to process and analyze the sequence data from raw reads to end-products such as predicted protein sequences or families. In this article, we provide a thorough review to simplify such processes and discuss the alternative methodologies that can be followed in order to explore biodiversity at the protein family level. We provide details for analysis tools and we comment on their scalability as well as their advantages and disadvantages. Finally, we report the available data repositories and recommend various approaches for protein family annotation related to phylogenetic distribution, structure prediction and metadata enrichment.

Keywords: biodiversity; cluster annotation; metagenomes; metatranscriptomes; microbial dark matter; protein clustering; protein families.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Illustration of a typical metagenomic analysis. (A) Sample collection, (B) Marker gene detection and taxonomic assignment. (C) DNA reads are mapped to a reference genome. (D) DNA reads are assembled into contigs using de novo assembly.
FIGURE 2
FIGURE 2
Gene calling and annotation in IMG/M (A), MGnify (B) and MG-RAST (C). Simplified overviews of the three workflows are shown. Gene calling operations (RNA or protein) are colored salmon pink, while gene annotation operations are colored light green. The tools used in each workflow are given in the graph and described in the main text. The workflows are based on the methodology described in Clum et al. (2021), Mitchell et al. (2019) and Meyer et al. (2019).
FIGURE 3
FIGURE 3
Sequence based Clustering. (A) A k-mer example, (B) Possible clusters based on common k-mers. (C) Different types of sequence assignment to clusters based on the alignment length coverage.
FIGURE 4
FIGURE 4
Graph-based family generation. (A) Sample collection, (B) All-against-all comparison. (C) SSN creation after applying, for example, an edge threshold of 50% identity, 50% alignment length. (D) Graph-based clustering.
FIGURE 5
FIGURE 5
Schematic representation of different 3D modeling approaches. (A) Homology modeling. (B) Threading. (C) Sequence coevolution. (D) Deep learning (the AlphaFold2 model is shown as an example).
FIGURE 6
FIGURE 6
Example structure search and functional annotation for a set of predicted 3D structures. In the first step, the models are filtered to keep only high-quality models, typically represented by a high predicted TM-score (pTM) value. The models are also clustered based on their structural similarity. The high quality, non-redundant set of models can then be searched against databases of structural domains (e.g., CATH-Gene3D, SCOP and SCOPe) with a fast, TM-score based method such as TM-align. Models with significant hits (TM-score ≥0.50) are functionally annotated based on their structural homologs. Models with no hits (TM-score <0.50) are further searched against databases of full-length structures (containing one or multiple domains), biological assemblies or protein-protein complexes (PDB, ModelArchive, AlphaFoldDB, etc.) with a multimeric complex-enabled search method such as MM-align. Again, models with significant hits are functionally annotated based on their homologs. Finally, models with no hits to any structural database can be considered as potential novel folds.
FIGURE 7
FIGURE 7
Example of a gene neighborhood analysis for a cluster of unannotated metagenome sequences, represented as “M-gene”. (A) Simplified visualization of a synteny analysis for seven metagenome scaffolds, containing members of the M-gene cluster. Each gene is represented by an arrow and colored differently. The direction of the arrows represents the directionality of the ORFs in each scaffold. In the analyzed scaffolds, the M-gene ORF co-occurs with a number of other protein-coding genes, each corresponding to a Pfam domain. (B) Gene co-occurrence network, based on the results of the synteny analysis. Each node represents a protein-coding gene and is colored using the same scheme as in (A). Edges (interactions) between nodes are derived based on the co-occurrence of their genes in the same scaffold. As it can be seen, the unannotated metagenomic cluster (M-gene) co-occurs with a tightly connected group of Pfam domains (Phosphoesterase, Glyco_tran_WecG, Polysacc_deacc_1 and Glyco_hydro_39), which are all found in the same scaffolds alongside M-gene members. In addition, M-gene co-occurs with RmID_sub_bind. Notably, four of the co-occurring protein domains are in the same functional category (Cell wall/membrane/envelope biogenesis), as indicated by their annotation in COG. This could mean that the unannotated M-gene cluster may participate in this function as well. The network was constructed using NORMA (Karatzas et al., 2022b).

Similar articles

Cited by

  • Unraveling the functional dark matter through global metagenomics.
    Pavlopoulos GA, Baltoumas FA, Liu S, Selvitopi O, Camargo AP, Nayfach S, Azad A, Roux S, Call L, Ivanova NN, Chen IM, Paez-Espino D, Karatzas E; Novel Metagenome Protein Families Consortium; Iliopoulos I, Konstantinidis K, Tiedje JM, Pett-Ridge J, Baker D, Visel A, Ouzounis CA, Ovchinnikov S, Buluç A, Kyrpides NC. Pavlopoulos GA, et al. Nature. 2023 Oct;622(7983):594-602. doi: 10.1038/s41586-023-06583-7. Epub 2023 Oct 11. Nature. 2023. PMID: 37821698 Free PMC article.
  • Visualizing metagenomic and metatranscriptomic data: A comprehensive review.
    Aplakidou E, Vergoulidis N, Chasapi M, Venetsianou NK, Kokoli M, Panagiotopoulou E, Iliopoulos I, Karatzas E, Pafilis E, Georgakopoulos-Soares I, Kyrpides NC, Pavlopoulos GA, Baltoumas FA. Aplakidou E, et al. Comput Struct Biotechnol J. 2024 May 3;23:2011-2033. doi: 10.1016/j.csbj.2024.04.060. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38765606 Free PMC article. Review.

References

    1. Akhter S., Aziz R. K., Edwards R. A. (2012). PhiSpy: A novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126. 10.1093/nar/gks406 - DOI - PMC - PubMed
    1. Alneberg J., Bjarnason B. S., de Bruijn I., Schirmer M., Quick J., Ijaz U. Z., et al. (2013). Concoct: Clustering cONtigs on COverage and ComposiTion. 10.48550/ARXIV.1312.4038 - DOI - PubMed
    1. Altschuh D., Lesk A. M., Bloomer A. C., Klug A. (1987). Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193, 693–707. 10.1016/0022-2836(87)90352-4 - DOI - PubMed
    1. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403–410. 10.1016/s0022-2836(05)80360-2 - DOI - PubMed
    1. Amgarten D., Braga L. P. P., da Silva A. M., Setubal J. C. (2018). MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet. 9, 304. 10.3389/fgene.2018.00304 - DOI - PMC - PubMed