. 2021 Dec 7;118(49):e2110828118.

doi: 10.1073/pnas.2110828118.

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Pengshuo Yang¹, Wei Zheng², Kang Ning³, Yang Zhang^{4

5}

Affiliations

¹ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
² Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.
³ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; ningkang@hust.edu.cn zhng@umich.edu.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109; ningkang@hust.edu.cn zhng@umich.edu.
⁵ Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109.

PMID: 34873061
PMCID: PMC8670487
DOI: 10.1073/pnas.2110828118

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Pengshuo Yang et al. Proc Natl Acad Sci U S A. 2021.

. 2021 Dec 7;118(49):e2110828118.

doi: 10.1073/pnas.2110828118.

Authors

Pengshuo Yang¹, Wei Zheng², Kang Ning³, Yang Zhang^{4

5}

Affiliations

¹ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China.
² Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.
³ Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-imaging, Center of Artificial Intelligence Biology, Department of Bioinformatics and Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China; ningkang@hust.edu.cn zhng@umich.edu.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109; ningkang@hust.edu.cn zhng@umich.edu.
⁵ Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109.

PMID: 34873061
PMCID: PMC8670487
DOI: 10.1073/pnas.2110828118

Abstract

Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

Keywords: deep learning; microbiome; multiple sequence alignments; protein homologous families; protein structure prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

**Fig. 1.**
Taxonomic and functional profiling for different microbiome samples. (A) The basic statistics of microbiome samples collected from the four biomes. (B) Species distribution on phylum level for samples in four biomes. The species distribution is categorized by their biomes and labeled with different colors. For all the samples, the top 10 phyla ranked by the average counts among all samples are illustrated. “Unassigned” means the species cannot be identified by a known phylum. “Other” represents the combination of the rest of the phyla. (C) Top-five genera ranked by relative abundances for four biomes. (D) PCoA result based on taxonomic profile on genus level for samples from the four biomes. Samples from the same biome are labeled with the same color. The CIs of samples in the same biome are marked in circles. (E) The shared and specific functional distribution for four biomes. The number labeled in the figure means the number (in billion) of specific or sheared sequences annotated by GO database. (F) PCoA result based on functional distribution for samples from the four biomes based on GO annotation. Samples from the same biome are labeled with the same color. The CIs of samples in the same biome are marked in circles.

**Fig. 2.**
Structural modeling results for unknown Pfam Hard families. (A) Number of Pfam families at each stage of the analysis, where each set is a subset of the previous set. (B) The C-score distribution of the Pfam Hard families with *Neff* >16. (C) Structural models on 13 newly solved Pfam families with C-score >−2.5. In each case, the C-I-TASSER model is shown in rainbow color, and the solved experimental structure of a member from the same Pfam family is shown in gray.

**Fig. 3.**
The taxonomic and functional properties of the Pfam families foldable by C-I-TASSER. (A) C-score distribution for Pfam families after replenishing by metagenome sequences. The vertical axis represents the C-score. For each panel, horizontal axis represents the Pfam families (31). (B) The relative abundance of species distribution for Pfam families which were foldable by C-I-TASSER. The species distribution is divided into four biomes and labeled with different colors. Calculated by the average count among all samples, the top 10 phyla are illustrated and ranked. “Other” represents the combination of the rest of the phyla. (C) Proteins in PF09828 are involved in the reduction of chromate accumulation and are essential for chromate resistance. Bacteria that hosts in plant produce the proteins identified as PF09828 to reduce the accumulation of chromate, resulting in the fast growth of the plant and preventing the transmission of cadmium to humans through the food chain leads to cadmium poisoning. For all the Pfam families which were foldable by C-I-TASSER, after aligning the Pfam species to the Interpro database, their protein functions were annotated by GO annotations and classified by three top annotations: Biological Process (D), Molecular Function (E), and Cellular Component (F).

**Fig. 4.**
Evaluation of marginal effect for Pfam families. Collected from the four biomes, the homology sequences distribution of Pfam family (A) PF04213, (B) PF10785, (C) PF13864, and (D) PF12357 are illustrated, where the source biome of these Pfams was estimated by MetaSource. (E) The sequence distribution of metagenome data from the four biomes for all 8,700 Pfam families with unsolved structures. After the sequences from four biomes were aligned to 8,700 Pfam families with unsolved structures, respectively, the marginal effect is estimated by comparison of the number of Pfam family’s homologous sequences before and after the use of the metagenome sequences. (F) Marginal effect categorized by protein structure estimate scores.

**Fig. 5.**
The source biomes predicted by MetaSource for Pfam families. (A) The receiver operating characteristic (ROC) analysis of binary-classification MetaSource model. This model was constructed to determine whether the source biome of the query Pfam family is one of the four biomes. (B) The ROC analysis of multiple-classification MetaSource model. This model was constructed to predict the source biome for Pfam families. To evaluate the overall prediction accuracy, the microaverage (obtained by aggregating the contributions of all classes to compute the average metric) and macroaverage value (calculated by the metric independently for each class and taking the average) were applied. (C) The Pfam classification result for all the Pfam families based on the prediction result of MetaSource model. (D) Average TM-score, accuracy of top-L contacts, and average MSA search time for the combined and MetaSource predicted biome datasets. (E) Case studies of modeling Pfam (PF08941 and PF00737) with MSA from different biomes. The model with the highest TM-score is shown in blue font. The model labeled with red frame is the source biome predicted by MetaSource.

See this image and copyright information in PMC

Cited by

Petascale Homology Search for Structure Prediction.
Lee S, Kim G, Karin EL, Mirdita M, Park S, Chikhi R, Babaian A, Kryshtafovych A, Steinegger M. Lee S, et al. bioRxiv [Preprint]. 2023 Jul 11:2023.07.10.548308. doi: 10.1101/2023.07.10.548308. bioRxiv. 2023. Update in: Cold Spring Harb Perspect Biol. 2024 May 2;16(5):a041465. doi: 10.1101/cshperspect.a041465. PMID: 37503235 Free PMC article. Updated. Preprint.
Characterization of Treponema denticola Major Surface Protein (Msp) by Deletion Analysis and Advanced Molecular Modeling.
Goetting-Minesky MP, Godovikova V, Zheng W, Fenno JC. Goetting-Minesky MP, et al. J Bacteriol. 2022 Sep 20;204(9):e0022822. doi: 10.1128/jb.00228-22. Epub 2022 Aug 1. J Bacteriol. 2022. PMID: 35913147 Free PMC article.
Improving AlphaFold2- and AlphaFold3-Based Protein Complex Structure Prediction With MULTICOM4 in CASP16.
Liu J, Neupane P, Cheng J. Liu J, et al. Proteins. 2025 Jun 2:10.1002/prot.26850. doi: 10.1002/prot.26850. Online ahead of print. Proteins. 2025. PMID: 40452318
Designing of thiazolidinones against chicken pox, monkey pox, and hepatitis viruses: A computational approach.
Raza MA, Farwa U, Ishaque F, Al-Sehemi AG. Raza MA, et al. Comput Biol Chem. 2023 Apr;103:107827. doi: 10.1016/j.compbiolchem.2023.107827. Epub 2023 Feb 12. Comput Biol Chem. 2023. PMID: 36805155 Free PMC article.
Integrating deep learning, threading alignments, and a multi-MSA strategy for high-quality protein monomer and complex structure prediction in CASP15.
Zheng W, Wuyun Q, Freddolino L, Zhang Y. Zheng W, et al. Proteins. 2023 Dec;91(12):1684-1703. doi: 10.1002/prot.26585. Epub 2023 Aug 31. Proteins. 2023. PMID: 37650367 Free PMC article.

See all "Cited by" articles

References

1. Baker D., Sali A., Protein structure prediction and structural genomics. Science 294, 93–96 (2001). - PubMed
1. Zhang Y., Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 18, 342–348 (2008). - PMC - PubMed
1. Simons K. T., Kooperberg C., Huang E., Baker D., Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225 (1997). - PubMed
1. Xu D., Zhang Y., Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins 80, 1715–1735 (2012). - PMC - PubMed
1. Yang J., et al. , The I-TASSER Suite: Protein structure and function prediction. Nat. Methods 12, 7–8 (2015). - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Affiliations

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous