Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 26;14(1):2351.
doi: 10.1038/s41467-023-37896-w.

Sequence-structure-function relationships in the microbial protein universe

Affiliations

Sequence-structure-function relationships in the microbial protein universe

Julia Koehler Leman et al. Nat Commun. .

Abstract

For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.

PubMed Disclaimer

Conflict of interest statement

R.B., V.G., and D.B. are currently working at Genentech and no explicit conflicts of interest result from this change in affiliation. All other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The fold space covered by the microbial protein structure universe is continuous.
a Flowchart of our process to arrive at ~200,000 de novo protein models covering a diverse sequence space. b The sequence length distribution shows that our sequences are shorter than many of the proteins in the PDB, CATH or AlphaFold databases, as expected. We predicted structures between 40 and 200 residues long, which covers the majority of length distributions in microbial proteins, which are often shorter than eukaryotic sequences. c The protein structure universe in UMAP space is color-coded according to features, such as similarity to CATH classes, sequence length, number of helical transmembrane spans, and relative contact order. d Novel folds (blue dots) are spread throughout the fold space with fewer representatives in the purely α-helical and purely β-sheet folds. [Icons in panel (a) were created by Ronald Vermeijs and Maxim Kulikov for the Noun Project, licensed under the Creative Common license CCBY3.0]. Source data for this figure are provided in the source data file.
Fig. 2
Fig. 2. Sequence-structure-function relationships in both PDB and the MIP dataset.
Pairwise comparisons of protein sequences (using sequence identity), structures (TM-score), and functions (cosine similarity between DeepFRI output vectors) for two datasets: a baseline from the PDB and the MIP_random5000_curated dataset, containing 3,052 Rosetta generated models (see Tables S2 and S3). The PDB baseline dataset contains 1000 chains covering pairwise sequence similarities between 0 and 100% while the MIP dataset is a non-redundant set with mostly dissimilar sequences (sequence identity <30% threshold was imposed before sequential domain splitting). Analyses of these two datasets in this way lead us to the following conclusions: sequence identity and structural similarity follow a known trend (Supplementary Fig. 72) (a), yet high structural similarity can be achieved by low sequence identity (b). High sequence identity (sequence identity > 50%) leads to high functional similarity (cosine similarity > 0.5) (c), yet high functional similarity can be achieved by proteins with low sequence identity (d). Structural similarity often correlates with functional similarity ((e) and quadrants II and III in (f)). However, there are plenty of examples where low structural similarity can be seen in proteins with high functional similarity (quadrant I in (f)), and highly similar structures can exhibit different functions (quadrant IV in (f)). Source data for this figure are provided in the source data file.
Fig. 3
Fig. 3. Functional diversity of proteins with the same structure.
We show examples from several structural clusters (Rosetta models) that exhibit novel folds. The heatmaps show functional similarity (cosine similarity of the function vectors) of protein pairs within the cluster. Proteins that have predicted functions with scores <0.1 are shown in gray in the heatmaps. Asterisks highlight the examples shown below. a, b Cases where the same structural motif in two different proteins produces different, unrelated functions. ce Cases where the same function is generated by different structural motifs in different proteins, even though the proteins have the same fold. Source data for this figure are provided in the source data file.
Fig. 4
Fig. 4. Structural diversity of proteins with the same function.
We examine proteins that have the same function and plot the TM-score as a measure of structural similarity as a heatmap, with larger numbers (more yellow) representing more similar structures. We also map the residue-specific function predictions onto the structures on the right, where residues in red are responsible for the functions. a Gene ontology molecular function carbohydrate binding with GO number GO:0030246. Except for the protein shown in (F) which has high helical propensity, the proteins in this functional cluster have high β-sheet content. The largest cluster in the heatmap in yellow is also the largest novel-fold cluster. The salient residues responsible for this function overlay nicely across the proteins in this cluster. b Gene ontology biological process function ‘maintenance of CRISPR repeat elements’ with GO number GO:0043571. The largest cluster highlighted in yellow superimposes with Cas2 and the salient residues in red interact with DNA. c Enzyme commission number EC 4.99.1. with the function ‘Sole sub-class for lyases that do not belong in the other subclasses’. All structures in this functional cluster have the same fold and the salient residues responsible for this function overlay onto the same structural motif in the protein. More details in the text. Source data for this figure are provided in the source data file.

References

    1. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. - DOI - PubMed
    1. Maynard Smith J. Natural selection and the concept of a protein space. Nature. 1970;225:563–564. doi: 10.1038/225563a0. - DOI - PubMed
    1. Aharoni A, et al. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 2004;37:73–76. doi: 10.1038/ng1482. - DOI - PubMed
    1. Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr. Opin. Struct. Biol. 2008;18:394–402. doi: 10.1016/j.sbi.2008.05.007. - DOI - PMC - PubMed
    1. Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. - DOI - PMC - PubMed

Publication types