Review

. 2022 Mar 6;1(1):e9.

doi: 10.1002/imt2.9. eCollection 2022 Mar.

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Pengshuo Yang¹, Kang Ning¹

Affiliations

Affiliation

¹ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology Wuhan Hubei China.

PMID: 38867727
PMCID: PMC10989767
DOI: 10.1002/imt2.9

Review

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Pengshuo Yang et al. Imeta. 2022.

. 2022 Mar 6;1(1):e9.

doi: 10.1002/imt2.9. eCollection 2022 Mar.

Authors

Pengshuo Yang¹, Kang Ning¹

Affiliation

¹ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology Wuhan Hubei China.

PMID: 38867727
PMCID: PMC10989767
DOI: 10.1002/imt2.9

Abstract

It has been proven that three-dimensional protein structures could be modeled by supplementing homologous sequences with metagenome sequences. Even though a large volume of metagenome data is utilized for such purposes, a significant proportion of proteins remain unsolved. In this review, we focus on identifying ecological and evolutionary patterns in metagenome data, decoding the complicated relationships of these patterns with protein structures, and investigating how these patterns can be effectively used to improve protein structure prediction. First, we proposed the metagenome utilization efficiency and marginal effect model to quantify the divergent distribution of homologous sequences for the protein family. Second, we proposed that the targeted approach effectively identifies homologous sequences from specified biomes compared with the untargeted approach's blind search. Finally, we determined the lower bound for metagenome data required for predicting all the protein structures in the Pfam database and showed that the present metagenome data is insufficient for this purpose. In summary, we discovered ecological and evolutionary patterns in the metagenome data that may be used to predict protein structures effectively. The targeted approach is promising in terms of effectively extracting homologous sequences and predicting protein structures using these patterns.

Keywords: ecology; evolution; metagenome data; protein 3D structure modeling; targeted approach.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
The number of Pfam families under release version changes up till Pfam version 34.0. The curve illustrates the number of Pfam families ranged by the release version. The pie charts attached to the corresponding release version reflect the proportion of Pfam families with known and unknown structures

**Figure 2**
Examining the data‐dependent ecological and evolutionary patterns behind the metagenome data from multiple aspects. To examine the correlation between metagenome and proteins in Pfam, evolutionary patterns, including the number of homologous sequences and protein properties, would be investigated. Moreover, the ecological patterns, including the enrichment patterns of source species and metagenome niche, would also be investigated

**Figure 3**
Metagenome sequence utilization efficiency evaluation. (A) Supplemented by the metagenome data set from different biomes, the homologous sequences were aligned to all the Pfam families, exemplified by metagenome from four biomes. Different color means their source biome and the shade of the color represents the number of metagenome sequences aligned to the corresponding Pfam families (the darker, more sequences aligned). (B) After homologous sequences aligned, the number of Pfam families predicted with reliable structures was calculated. Averagely, after using metagenome sequences (billion sequences), the number of homologous sequences aligned, and reliable structure modeled were calculated. Then, the metagenome sequence utilization efficiency was evaluated by calculating the proportion of the number of Pfam families in the number of metagenome sequences and the proportion of the number of supplemented homologous sequences in all the metagenome sequences

**Figure 4**
Marginal effects evaluation. Based on the data in reference [70], the marginal effects of the four biomes (Gut, Lake, Soil, Fermentor) on all the 8700 unknown Pfam families (version 32.0) were evaluated, described in reference [70]. The background is an ontology structure that contains the protein families and their relationships, while different colors indicated the high marginal effect values for that protein family by a certain biome. The marginal effect values are also annotated beside several proteins of interest. The data show that the contributions of different biomes to a specific Pfam can be drastically different, as reflected by their marginal values

**Figure 5**
The targeted approach is essentially an enrichment approach. (A) Untargeted approach for the protein 3D structure prediction supplemented by metagenome. (B) Targeted approach for the protein 3D structure prediction supplemented by metagenome. (C) Case studies of modeling Pfam PF07682 and PF05005 with MSA from different biomes as the untargeted approach. For each biome, the number of metagenome sequences and the proportion of aligned homologous sequences in all the metagenome sequences was calculated. The correctness of 3D structure models was determined by comparing them to the known structure, which was quantified using the TM‐score method. The MetaSource is a targeted approach that was developed in a prior study [70]. The model labeled with gray background color is the source biome predicted by MetaSource. In blue type, the model with the highest TM‐score is displayed. 3D, three‐dimensional; MSA, multiple sequence alignment

**Figure 6**
The relationships between the increasing number of proteins, and the increasing amount of metagenome sequences. (A) The number of sequences in Pfam under different versions. (B) The correlation between the number of metagenome sequences and the number of sequences in Pfam. Each node represents a Pfam release version

See this image and copyright information in PMC

Cited by

iMeta: Integrated meta-omics for biology and environments.
Liu YX, Chen T, Li D, Fu J, Liu SJ. Liu YX, et al. Imeta. 2022 Mar 28;1(1):e15. doi: 10.1002/imt2.15. eCollection 2022 Mar. Imeta. 2022. PMID: 38867730 Free PMC article.
Leveraging computer-aided design and artificial intelligence to develop a next-generation multi-epitope tuberculosis vaccine candidate.
Zhuang L, Ali A, Yang L, Ye Z, Li L, Ni R, An Y, Ali SL, Gong W. Zhuang L, et al. Infect Med (Beijing). 2024 Nov 9;3(4):100148. doi: 10.1016/j.imj.2024.100148. eCollection 2024 Dec. Infect Med (Beijing). 2024. PMID: 39687693 Free PMC article.
MicroEXPERT: Microbiome profiling platform with cross-study metagenome-wide association analysis functionality.
Yang P, Yang J, Long H, Huang K, Ji L, Lin H, Jiang X, Wang AK, Tian G, Ning K. Yang P, et al. Imeta. 2023 Aug 17;2(4):e131. doi: 10.1002/imt2.131. eCollection 2023 Nov. Imeta. 2023. PMID: 38868224 Free PMC article.

References

1. Britton, Candace S. , Sorrells Trevor R., and Johnson Alexander D.. 2020. “Protein‐Coding Changes Preceded Cis‐Regulatory Gains in a Newly Evolved Transcription Circuit.” Science 367: 96–100. 10.1126/science.aax5217 - DOI - PMC - PubMed
1. Levin, Doron , Raab Neta, Pinto Yishay, Rothschild Daphna, Zanir Gal, Godneva Anastasia, Mellul Nadav, et al. 2021. “Diversity and Functional Landscapes in the Microbiota of Animals in the Wild.” Science 372(6539): eabb5352. 10.1126/science.abb5352 - DOI - PubMed
1. North, Justin A. , Narrowe Adrienne B., Xiong Weili, Byerly Kathryn M., Zhao Guanqi, Young Sarah J., Murali Srividya, et al. 2020. “A Nitrogenase‐like Enzyme System Catalyzes Methionine, Ethylene, and Methane Biogenesis.” Science 369: 1094–98. 10.1126/science.abb6310 - DOI - PubMed
1. Zhang, Chengxin , Zheng Wei, Freddolino Peter L., and Zhang Yang. 2018. “MetaGO: Predicting Gene Ontology of Non‐Homologous Proteins Through Low‐Resolution Protein Structure Prediction and Protein‐Protein Network Mapping.” Journal of Molecular Biology 430: 2256–65. 10.1016/j.jmb.2018.03.004 - DOI - PMC - PubMed
1. Zheng, Wei , Zhang Chengxin, Li Yang, Pearce Robin, Bell Eric W., and Zhang Yang. 2021. “Folding Non‐Homologous Proteins by Coupling Deep‐Learning Contact Maps with I‐TASSER Assembly Simulations.” Cell Reports Methods 1(3): 100014. 10.1016/j.crmeth.2021.100014 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Affiliation

How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

Related information

LinkOut - more resources

Full Text Sources