Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Mar 6;1(1):e9.
doi: 10.1002/imt2.9. eCollection 2022 Mar.

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Affiliations
Review

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Pengshuo Yang et al. Imeta. .

Abstract

It has been proven that three-dimensional protein structures could be modeled by supplementing homologous sequences with metagenome sequences. Even though a large volume of metagenome data is utilized for such purposes, a significant proportion of proteins remain unsolved. In this review, we focus on identifying ecological and evolutionary patterns in metagenome data, decoding the complicated relationships of these patterns with protein structures, and investigating how these patterns can be effectively used to improve protein structure prediction. First, we proposed the metagenome utilization efficiency and marginal effect model to quantify the divergent distribution of homologous sequences for the protein family. Second, we proposed that the targeted approach effectively identifies homologous sequences from specified biomes compared with the untargeted approach's blind search. Finally, we determined the lower bound for metagenome data required for predicting all the protein structures in the Pfam database and showed that the present metagenome data is insufficient for this purpose. In summary, we discovered ecological and evolutionary patterns in the metagenome data that may be used to predict protein structures effectively. The targeted approach is promising in terms of effectively extracting homologous sequences and predicting protein structures using these patterns.

Keywords: ecology; evolution; metagenome data; protein 3D structure modeling; targeted approach.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
The number of Pfam families under release version changes up till Pfam version 34.0. The curve illustrates the number of Pfam families ranged by the release version. The pie charts attached to the corresponding release version reflect the proportion of Pfam families with known and unknown structures
Figure 2
Figure 2
Examining the data‐dependent ecological and evolutionary patterns behind the metagenome data from multiple aspects. To examine the correlation between metagenome and proteins in Pfam, evolutionary patterns, including the number of homologous sequences and protein properties, would be investigated. Moreover, the ecological patterns, including the enrichment patterns of source species and metagenome niche, would also be investigated
Figure 3
Figure 3
Metagenome sequence utilization efficiency evaluation. (A) Supplemented by the metagenome data set from different biomes, the homologous sequences were aligned to all the Pfam families, exemplified by metagenome from four biomes. Different color means their source biome and the shade of the color represents the number of metagenome sequences aligned to the corresponding Pfam families (the darker, more sequences aligned). (B) After homologous sequences aligned, the number of Pfam families predicted with reliable structures was calculated. Averagely, after using metagenome sequences (billion sequences), the number of homologous sequences aligned, and reliable structure modeled were calculated. Then, the metagenome sequence utilization efficiency was evaluated by calculating the proportion of the number of Pfam families in the number of metagenome sequences and the proportion of the number of supplemented homologous sequences in all the metagenome sequences
Figure 4
Figure 4
Marginal effects evaluation. Based on the data in reference [70], the marginal effects of the four biomes (Gut, Lake, Soil, Fermentor) on all the 8700 unknown Pfam families (version 32.0) were evaluated, described in reference [70]. The background is an ontology structure that contains the protein families and their relationships, while different colors indicated the high marginal effect values for that protein family by a certain biome. The marginal effect values are also annotated beside several proteins of interest. The data show that the contributions of different biomes to a specific Pfam can be drastically different, as reflected by their marginal values
Figure 5
Figure 5
The targeted approach is essentially an enrichment approach. (A) Untargeted approach for the protein 3D structure prediction supplemented by metagenome. (B) Targeted approach for the protein 3D structure prediction supplemented by metagenome. (C) Case studies of modeling Pfam PF07682 and PF05005 with MSA from different biomes as the untargeted approach. For each biome, the number of metagenome sequences and the proportion of aligned homologous sequences in all the metagenome sequences was calculated. The correctness of 3D structure models was determined by comparing them to the known structure, which was quantified using the TM‐score method. The MetaSource is a targeted approach that was developed in a prior study [70]. The model labeled with gray background color is the source biome predicted by MetaSource. In blue type, the model with the highest TM‐score is displayed. 3D, three‐dimensional; MSA, multiple sequence alignment
Figure 6
Figure 6
The relationships between the increasing number of proteins, and the increasing amount of metagenome sequences. (A) The number of sequences in Pfam under different versions. (B) The correlation between the number of metagenome sequences and the number of sequences in Pfam. Each node represents a Pfam release version

Similar articles

Cited by

References

    1. Britton, Candace S. , Sorrells Trevor R., and Johnson Alexander D.. 2020. “Protein‐Coding Changes Preceded Cis‐Regulatory Gains in a Newly Evolved Transcription Circuit.” Science 367: 96–100. 10.1126/science.aax5217 - DOI - PMC - PubMed
    1. Levin, Doron , Raab Neta, Pinto Yishay, Rothschild Daphna, Zanir Gal, Godneva Anastasia, Mellul Nadav, et al. 2021. “Diversity and Functional Landscapes in the Microbiota of Animals in the Wild.” Science 372(6539): eabb5352. 10.1126/science.abb5352 - DOI - PubMed
    1. North, Justin A. , Narrowe Adrienne B., Xiong Weili, Byerly Kathryn M., Zhao Guanqi, Young Sarah J., Murali Srividya, et al. 2020. “A Nitrogenase‐like Enzyme System Catalyzes Methionine, Ethylene, and Methane Biogenesis.” Science 369: 1094–98. 10.1126/science.abb6310 - DOI - PubMed
    1. Zhang, Chengxin , Zheng Wei, Freddolino Peter L., and Zhang Yang. 2018. “MetaGO: Predicting Gene Ontology of Non‐Homologous Proteins Through Low‐Resolution Protein Structure Prediction and Protein‐Protein Network Mapping.” Journal of Molecular Biology 430: 2256–65. 10.1016/j.jmb.2018.03.004 - DOI - PMC - PubMed
    1. Zheng, Wei , Zhang Chengxin, Li Yang, Pearce Robin, Bell Eric W., and Zhang Yang. 2021. “Folding Non‐Homologous Proteins by Coupling Deep‐Learning Contact Maps with I‐TASSER Assembly Simulations.” Cell Reports Methods 1(3): 100014. 10.1016/j.crmeth.2021.100014 - DOI - PMC - PubMed

LinkOut - more resources