Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Dec 7;118(49):e2110828118.
doi: 10.1073/pnas.2110828118.

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Affiliations

Decoding the link of microbiome niches with homologous sequences enables accurately targeted protein structure prediction

Pengshuo Yang et al. Proc Natl Acad Sci U S A. .

Abstract

Information derived from metagenome sequences through deep-learning techniques has significantly improved the accuracy of template free protein structure modeling. However, most of the deep learning-based modeling studies are based on blind sequence database searches and suffer from low efficiency in computational resource utilization and model construction, especially when the sequence library becomes prohibitively large. We proposed a MetaSource model built on 4.25 billion microbiome sequences from four major biomes (Gut, Lake, Soil, and Fermentor) to decode the inherent linkage of microbial niches with protein homologous families. Large-scale protein family folding experiments on 8,700 unknown Pfam families showed that a microbiome targeted approach with multiple sequence alignment constructed from individual MetaSource biomes requires more than threefold less computer memory and CPU (central processing unit) time but generates contact-map and three-dimensional structure models with a significantly higher accuracy, compared with that using combined metagenome datasets. These results demonstrate an avenue to bridge the gap between the rapidly increasing metagenome databases and the limited computing resources for efficient genome-wide database mining, which provides a useful bluebook to guide future microbiome sequence database and modeling development for high-accuracy protein structure and function prediction.

Keywords: deep learning; microbiome; multiple sequence alignments; protein homologous families; protein structure prediction.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
Taxonomic and functional profiling for different microbiome samples. (A) The basic statistics of microbiome samples collected from the four biomes. (B) Species distribution on phylum level for samples in four biomes. The species distribution is categorized by their biomes and labeled with different colors. For all the samples, the top 10 phyla ranked by the average counts among all samples are illustrated. “Unassigned” means the species cannot be identified by a known phylum. “Other” represents the combination of the rest of the phyla. (C) Top-five genera ranked by relative abundances for four biomes. (D) PCoA result based on taxonomic profile on genus level for samples from the four biomes. Samples from the same biome are labeled with the same color. The CIs of samples in the same biome are marked in circles. (E) The shared and specific functional distribution for four biomes. The number labeled in the figure means the number (in billion) of specific or sheared sequences annotated by GO database. (F) PCoA result based on functional distribution for samples from the four biomes based on GO annotation. Samples from the same biome are labeled with the same color. The CIs of samples in the same biome are marked in circles.
Fig. 2.
Fig. 2.
Structural modeling results for unknown Pfam Hard families. (A) Number of Pfam families at each stage of the analysis, where each set is a subset of the previous set. (B) The C-score distribution of the Pfam Hard families with Neff >16. (C) Structural models on 13 newly solved Pfam families with C-score >−2.5. In each case, the C-I-TASSER model is shown in rainbow color, and the solved experimental structure of a member from the same Pfam family is shown in gray.
Fig. 3.
Fig. 3.
The taxonomic and functional properties of the Pfam families foldable by C-I-TASSER. (A) C-score distribution for Pfam families after replenishing by metagenome sequences. The vertical axis represents the C-score. For each panel, horizontal axis represents the Pfam families (31). (B) The relative abundance of species distribution for Pfam families which were foldable by C-I-TASSER. The species distribution is divided into four biomes and labeled with different colors. Calculated by the average count among all samples, the top 10 phyla are illustrated and ranked. “Other” represents the combination of the rest of the phyla. (C) Proteins in PF09828 are involved in the reduction of chromate accumulation and are essential for chromate resistance. Bacteria that hosts in plant produce the proteins identified as PF09828 to reduce the accumulation of chromate, resulting in the fast growth of the plant and preventing the transmission of cadmium to humans through the food chain leads to cadmium poisoning. For all the Pfam families which were foldable by C-I-TASSER, after aligning the Pfam species to the Interpro database, their protein functions were annotated by GO annotations and classified by three top annotations: Biological Process (D), Molecular Function (E), and Cellular Component (F).
Fig. 4.
Fig. 4.
Evaluation of marginal effect for Pfam families. Collected from the four biomes, the homology sequences distribution of Pfam family (A) PF04213, (B) PF10785, (C) PF13864, and (D) PF12357 are illustrated, where the source biome of these Pfams was estimated by MetaSource. (E) The sequence distribution of metagenome data from the four biomes for all 8,700 Pfam families with unsolved structures. After the sequences from four biomes were aligned to 8,700 Pfam families with unsolved structures, respectively, the marginal effect is estimated by comparison of the number of Pfam family’s homologous sequences before and after the use of the metagenome sequences. (F) Marginal effect categorized by protein structure estimate scores.
Fig. 5.
Fig. 5.
The source biomes predicted by MetaSource for Pfam families. (A) The receiver operating characteristic (ROC) analysis of binary-classification MetaSource model. This model was constructed to determine whether the source biome of the query Pfam family is one of the four biomes. (B) The ROC analysis of multiple-classification MetaSource model. This model was constructed to predict the source biome for Pfam families. To evaluate the overall prediction accuracy, the microaverage (obtained by aggregating the contributions of all classes to compute the average metric) and macroaverage value (calculated by the metric independently for each class and taking the average) were applied. (C) The Pfam classification result for all the Pfam families based on the prediction result of MetaSource model. (D) Average TM-score, accuracy of top-L contacts, and average MSA search time for the combined and MetaSource predicted biome datasets. (E) Case studies of modeling Pfam (PF08941 and PF00737) with MSA from different biomes. The model with the highest TM-score is shown in blue font. The model labeled with red frame is the source biome predicted by MetaSource.

Similar articles

Cited by

References

    1. Baker D., Sali A., Protein structure prediction and structural genomics. Science 294, 93–96 (2001). - PubMed
    1. Zhang Y., Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 18, 342–348 (2008). - PMC - PubMed
    1. Simons K. T., Kooperberg C., Huang E., Baker D., Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225 (1997). - PubMed
    1. Xu D., Zhang Y., Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins 80, 1715–1735 (2012). - PMC - PubMed
    1. Yang J., et al. , The I-TASSER Suite: Protein structure and function prediction. Nat. Methods 12, 7–8 (2015). - PMC - PubMed

Publication types

LinkOut - more resources