. 2023 Apr 26;14(1):2351.

doi: 10.1038/s41467-023-37896-w.

Sequence-structure-function relationships in the microbial protein universe

Julia Koehler Leman^#^{1

2}, Pawel Szczerbiak^#³, P Douglas Renfrew^#^{4

5}, Vladimir Gligorijevic^{4

6}, Daniel Berenberg^{4

6

7

8}, Tommi Vatanen^{9

10

11}, Bryn C Taylor^{12

13}, Chris Chandler⁴, Stefan Janssen^{14

15}, Andras Pataki¹⁶, Nick Carriero¹⁶, Ian Fisk¹⁶, Ramnik J Xavier^{9

17}, Rob Knight^{12

14

18

19}, Richard Bonneau^{4

5

7

8

6}, Tomasz Kosciolek^#²⁰

Affiliations

¹ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA. julia.koehler.leman@gmail.com.
² Department of Biology, New York University, New York, NY, USA. julia.koehler.leman@gmail.com.
³ Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
⁴ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
⁵ Department of Biology, New York University, New York, NY, USA.
⁶ Prescient Design, a Genentech accelerator, New York, NY, 10010, USA.
⁷ Center for Data Science, New York University, New York, NY, 10011, USA.
⁸ Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA.
⁹ Broad Institute, Cambridge, MA, USA.
¹⁰ Liggins Institute, University of Auckland, Auckland, New Zealand.
¹¹ Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, 00014 University of Helsinki, Helsinki, Finland.
¹² Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.
¹³ In Silico Discovery and External Innovation, Janssen Research and Development, San Diego, CA, 92122, USA.
¹⁴ Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA.
¹⁵ Algorithmic Bioinformatics, Justus Liebig University Giessen, Giessen, Germany.
¹⁶ Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA.
¹⁷ Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, 02139, USA.
¹⁸ Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
¹⁹ Department of Bioengineering, University of California, San Diego, USA.
²⁰ Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. tomasz.kosciolek@uj.edu.pl.

^# Contributed equally.

PMID: 37100781
PMCID: PMC10133388
DOI: 10.1038/s41467-023-37896-w

Sequence-structure-function relationships in the microbial protein universe

Julia Koehler Leman et al. Nat Commun. 2023.

. 2023 Apr 26;14(1):2351.

doi: 10.1038/s41467-023-37896-w.

Authors

Affiliations

¹ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA. julia.koehler.leman@gmail.com.
² Department of Biology, New York University, New York, NY, USA. julia.koehler.leman@gmail.com.
³ Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
⁴ Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
⁵ Department of Biology, New York University, New York, NY, USA.
⁶ Prescient Design, a Genentech accelerator, New York, NY, 10010, USA.
⁷ Center for Data Science, New York University, New York, NY, 10011, USA.
⁸ Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA.
⁹ Broad Institute, Cambridge, MA, USA.
¹⁰ Liggins Institute, University of Auckland, Auckland, New Zealand.
¹¹ Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, 00014 University of Helsinki, Helsinki, Finland.
¹² Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.
¹³ In Silico Discovery and External Innovation, Janssen Research and Development, San Diego, CA, 92122, USA.
¹⁴ Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA.
¹⁵ Algorithmic Bioinformatics, Justus Liebig University Giessen, Giessen, Germany.
¹⁶ Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA.
¹⁷ Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, 02139, USA.
¹⁸ Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
¹⁹ Department of Bioengineering, University of California, San Diego, USA.
²⁰ Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland. tomasz.kosciolek@uj.edu.pl.

^# Contributed equally.

PMID: 37100781
PMCID: PMC10133388
DOI: 10.1038/s41467-023-37896-w

Abstract

For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.

PubMed Disclaimer

Conflict of interest statement

R.B., V.G., and D.B. are currently working at Genentech and no explicit conflicts of interest result from this change in affiliation. All other authors declare no competing interests.

Figures

**Fig. 1. The fold space covered by the microbial protein structure universe is continuous.**
a Flowchart of our process to arrive at ~200,000 de novo protein models covering a diverse sequence space. b The sequence length distribution shows that our sequences are shorter than many of the proteins in the PDB, CATH or AlphaFold databases, as expected. We predicted structures between 40 and 200 residues long, which covers the majority of length distributions in microbial proteins, which are often shorter than eukaryotic sequences. c The protein structure universe in UMAP space is color-coded according to features, such as similarity to CATH classes, sequence length, number of helical transmembrane spans, and relative contact order. d Novel folds (blue dots) are spread throughout the fold space with fewer representatives in the purely α-helical and purely β-sheet folds. [Icons in panel (a) were created by Ronald Vermeijs and Maxim Kulikov for the Noun Project, licensed under the Creative Common license CCBY3.0]. Source data for this figure are provided in the source data file.

**Fig. 2. Sequence-structure-function relationships in both PDB and the MIP dataset.**
Pairwise comparisons of protein sequences (using sequence identity), structures (TM-score), and functions (cosine similarity between DeepFRI output vectors) for two datasets: a baseline from the PDB and the *MIP_random5000_curated* dataset, containing 3,052 Rosetta generated models (see Tables S2 and S3). The PDB baseline dataset contains 1000 chains covering pairwise sequence similarities between 0 and 100% while the MIP dataset is a non-redundant set with mostly dissimilar sequences (sequence identity <30% threshold was imposed before sequential domain splitting). Analyses of these two datasets in this way lead us to the following conclusions: sequence identity and structural similarity follow a known trend (Supplementary Fig. 72) (a), yet high structural similarity can be achieved by low sequence identity (b). High sequence identity (sequence identity > 50%) leads to high functional similarity (cosine similarity > 0.5) (c), yet high functional similarity can be achieved by proteins with low sequence identity (d). Structural similarity often correlates with functional similarity ((e) and quadrants II and III in (f)). However, there are plenty of examples where low structural similarity can be seen in proteins with high functional similarity (quadrant I in (f)), and highly similar structures can exhibit different functions (quadrant IV in (f)). Source data for this figure are provided in the source data file.

**Fig. 3. Functional diversity of proteins with the same structure.**
We show examples from several structural clusters (Rosetta models) that exhibit novel folds. The heatmaps show functional similarity (cosine similarity of the function vectors) of protein pairs within the cluster. Proteins that have predicted functions with scores <0.1 are shown in gray in the heatmaps. Asterisks highlight the examples shown below. a, b Cases where the same structural motif in two different proteins produces different, unrelated functions. c–e Cases where the same function is generated by different structural motifs in different proteins, even though the proteins have the same fold. Source data for this figure are provided in the source data file.

**Fig. 4. Structural diversity of proteins with the same function.**
We examine proteins that have the same function and plot the TM-score as a measure of structural similarity as a heatmap, with larger numbers (more yellow) representing more similar structures. We also map the residue-specific function predictions onto the structures on the right, where residues in red are responsible for the functions. a Gene ontology molecular function carbohydrate binding with GO number GO:0030246. Except for the protein shown in (F) which has high helical propensity, the proteins in this functional cluster have high β-sheet content. The largest cluster in the heatmap in yellow is also the largest novel-fold cluster. The salient residues responsible for this function overlay nicely across the proteins in this cluster. b Gene ontology biological process function ‘maintenance of CRISPR repeat elements’ with GO number GO:0043571. The largest cluster highlighted in yellow superimposes with Cas2 and the salient residues in red interact with DNA. c Enzyme commission number EC 4.99.1. with the function ‘Sole sub-class for lyases that do not belong in the other subclasses’. All structures in this functional cluster have the same fold and the salient residues responsible for this function overlay onto the same structural motif in the protein. More details in the text. Source data for this figure are provided in the source data file.

See this image and copyright information in PMC

References

1. Anfinsen CB. Principles that govern the folding of protein chains. Science. 1973;181:223–230. doi: 10.1126/science.181.4096.223. - DOI - PubMed
1. Maynard Smith J. Natural selection and the concept of a protein space. Nature. 1970;225:563–564. doi: 10.1038/225563a0. - DOI - PubMed
1. Aharoni A, et al. The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 2004;37:73–76. doi: 10.1038/ng1482. - DOI - PubMed
1. Redfern OC, Dessailly B, Orengo CA. Exploring the structure and function paradigm. Curr. Opin. Struct. Biol. 2008;18:394–402. doi: 10.1016/j.sbi.2008.05.007. - DOI - PMC - PubMed
1. Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

P30 DK043351/DK/NIDDK NIH HHS/United States

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sequence-structure-function relationships in the microbial protein universe

Affiliations

Sequence-structure-function relationships in the microbial protein universe

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous