. 2023 Oct;622(7983):637-645.

doi: 10.1038/s41586-023-06510-w. Epub 2023 Sep 13.

Clustering predicted structures at the scale of the known protein universe

Inigo Barrio-Hernandez^#¹, Jingi Yeo^#², Jürgen Jänes³, Milot Mirdita², Cameron L M Gilchrist², Tanita Wein⁴, Mihaly Varadi¹, Sameer Velankar¹, Pedro Beltrao^{5

6}, Martin Steinegger^{7

8

9}

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK.
² School of Biological Sciences, Seoul National University, Seoul, South Korea.
³ Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.
⁴ Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.
⁵ Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland. beltrao@imsb.biol.ethz.ch.
⁶ Swiss Institute of Bioinformatics, Lausanne, Switzerland. beltrao@imsb.biol.ethz.ch.
⁷ School of Biological Sciences, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.
⁸ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.
⁹ Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.

^# Contributed equally.

PMID: 37704730
PMCID: PMC10584675
DOI: 10.1038/s41586-023-06510-w

Clustering predicted structures at the scale of the known protein universe

Inigo Barrio-Hernandez et al. Nature. 2023 Oct.

. 2023 Oct;622(7983):637-645.

doi: 10.1038/s41586-023-06510-w. Epub 2023 Sep 13.

Authors

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK.
² School of Biological Sciences, Seoul National University, Seoul, South Korea.
³ Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland.
⁴ Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel.
⁵ Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland. beltrao@imsb.biol.ethz.ch.
⁶ Swiss Institute of Bioinformatics, Lausanne, Switzerland. beltrao@imsb.biol.ethz.ch.
⁷ School of Biological Sciences, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.
⁸ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.
⁹ Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea. martin.steinegger@snu.ac.kr.

^# Contributed equally.

PMID: 37704730
PMCID: PMC10584675
DOI: 10.1038/s41586-023-06510-w

Abstract

Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy¹, and over 214 million predicted structures are available in the AlphaFold database². However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. The AFDB, structural clustering workflow and summary of the clusters.**
a, The AFDB started as a collaborative effort between EMBL-EBI and DeepMind in 2021. The database grew in multiple stages, with the latest version of 2022 containing over 214 million predicted protein structures and their confidence metrics. b, A two-step approach was used to cluster proteins in the database. First, MMseqs2 was used to cluster 214 million UniProtKB protein sequences (AFDB) on the basis of 50% sequence identity and 90% sequence overlap, resulting in a reduction of the database size to 52 million clusters (AFDB50). For each cluster, the protein with the highest pLDDT score was selected as the representative. Next, using Foldseek, the representative structures were clustered into 18.8 million clusters (Foldseek clusters) without a sequence identity threshold, but still enforcing a 90% sequence overlap and an E-value of less than 0.01 for each structural alignment. As the last step, we removed all sequences labelled as fragments from the clustering, ending up with 2.30 million clusters with at least two structures (AFDB clusters). c, AFDB cluster structural and Pfam consistency. Our clusters have a median LDDT of 0.77 and a median TM score of 0.71 across all clusters and 66.5% of clusters with Pfam annotations are 100% consistent. d, Summary of sequences and clusters with and without annotation (left) and the relationship of cluster sizes to annotation (right). From left to right, each bin occupies AFDB clusters at rates of 12.24%, 10.59%, 9.20%, 10.07%, 10.46%, 10.05%, 9.04%, 9.20%, 9.19% and 9.96%.

**Fig. 2. Putative novel enzymes and small-molecule-binding proteins in structures lacking annotation.**
a, Counts of GO molecular function terms that are most often predicted by DeepFRI on the set of selected 1,707 structures with predicted pockets. b–d, Examples of structures (A0A849TG76 and A0A2D8BRH7 (b), A0A849ZK06 (c) and S0EUL8 (d)) with predicted pockets and functional annotations. Each example shows the UniProt ID (top), the highest-scoring DeepFRI function prediction (bottom) and the top-scoring pocket (pink surface). The structures are coloured by residue-level contributions to the DeepFRI function predictions, ranging from blue (no contribution) to yellow (strong contribution).

**Fig. 3. Evolutionary distribution of clusters and human-centric cluster analysis.**
a, Visualization of the LCA of all non-singleton clusters as a Sankey plot produced by Pavian. Only the largest 13 taxonomical nodes per rank are shown. b, The distribution of selected GO terms across the human lineage of the LCA based on the analysis of human protein-containing clusters (abundance is normalized per GO category). c, Three example structures from the human clusters that are conserved across humans and bacteria, among the eukaryote GO-annotated clusters. A histone protein with a nucleus GO annotation, which was found to be conserved at the cellular organism level and supports the previously reported evolutionary connection between eukaryotic and bacterial histones (left). The human innate immunity genes *BPI* (middle) and *AIM2* (right) encode structurally similar proteins in bacterial species, highlighting the potential for cross-kingdom sharing of immunity-related proteins. *Acido. bacterium*, *Acidobacteria bacterium*; *Actino. bacterium*, *Actinomycetia bacterium*; ‘*Ca.* Bathyarchaeota archaeon’, ‘*Candidatus* Bathyarchaeota archaeon’; *D. bacterium*, *Deltaproteobacteria bacterium*; *H. pylori*, *Helicobacter pylori*; memb., membrane; *P. bacterium*, *Planctomycetes bacterium*; *R. irregularis*, *Rhizophagus irregularis*; *S. enterica*, *Salmonella enterica*; *T. cinerariifolium*, *Tanacetum cinerariifolium*.

**Fig. 4. Prediction of domain families by local structural similarity hits.**
a, Diagram of the structure-based domain family prediction method. Clustering of the start and end positions for Foldseek hits of one protein against all others was used to define potential domain boundary positions. Each predicted domain region was linked to the others sharing structural similarities and graph-based clustering was used to define domain families and interdomain similarity. b, The frequency distribution of the most common (n = 9,631) and the second most common (n = 1,628) Pfam annotations found members of all predicted domain families. anno., annotation. c, The counts of the number of clusters with a given Pfam as the most frequent. d, The number of domain family clusters annotated to a Pfam, DUF or no domain annotation. e, The distribution of protein region length in the predicted domain families, stratified by their annotations: Pfam domain (n = 1,048,276), DUF (n = 72,798) and not annotated (n = 1,904,498). f, Non-redundant count of Pfam and DUF domain families found in the structure-based predicted families. g, The distribution of the number of structures found for each predicted domain family annotated with a known Pfam (n = 3,875) or DUF domain (n = 1,513). The top 6 Pfam annotations are highlighted using their abbreviations: Pkinase, protein kinase domain PF00069); zf-C2H2, zinc finger, C2H2 type PF00096; Ank_2, ankyrin repeats, PF12796; RVT_1, reverse transcriptase, PF00078; WD40, WD domain, G-beta repeat PF00400; ABC_tran, ABC transporter, PF00005. The box plots in b, e and g show the median (centre line), the quartiles 1 and 3 (box limits) and 1.5 × the interquartile range (whiskers).

**Fig. 5. Examples of non-annotated domain families with structural similarity to annotated domain families.**
a, Frag1-like domains. Three clusters were found enriched for the Frag1 Pfam annotation that had structural similarity to one cluster enriched for a domain of unknown significance (DUF998) and one cluster without annotations. b, Anthrax_toxA-like domains. A cluster enriched for the anthrax_toxA Pfam annotation was found with structural similarity to a cluster with no annotations. c, Two clusters without annotations were found with structural similarity to a cluster enriched for the gasdermin Pfam annotations. Cluster 2 gasdermin N-terminal domain structures reveal homology to human gasdermin E. The corresponding structural characteristics are highlighted. Some gasdermin domains were found fused to protease domains (UniProt: A0A2C5ZLK3). The bacterial gasdermin structure (PDB: 7N51) is similar to novel gasdermin domains from non-annotated cluster 2. The third cluster revealed homology to both animal and bacterial gasdermins.

**Extended Data Fig. 1. The five-step clustering pipeline for efficiently clustering millions of protein structures using Foldseek’s 3Di alphabet.**
(1) Protein structures are converted to 3Di sequences and processed through the Linclust workflow. (2) For each sequence, 300 min-hasing k-mers are extracted and sorted. (3) The longest structure is assigned to be the centre of each k-mer cluster. (4) Structural alignment is performed in two stages: first an ungapped alignment based on shared diagonal information is performed, hits are pre-clustered and second the remaining sequences are aligned using Foldseek’s structural Smith-Waterman. (5) The remaining structures meeting alignment criteria are clustered using MMseqs2’s clustering module. After the Linclust step the centroids are successively clustered by three cascaded steps of prefiltering, structural Smith-Waterman alignment and clustering using Foldseek’s search.

**Extended Data Fig. 2. Relationship of mean pairwise Pfam consistency to cluster features. These graphs are plotted with 1,004,422 clusters with at least two Pfam annotated sequences.**
(a) We analysed Pfam consistency of clusters binned by their member counter. These bins represent Pfam annotated non-singleton clusters at rates of 19.2%, 13.5%, 9.5%, 12.6%, 11.0%, 12.4%, 11.8% and 10.0% from left to right, respectively. (b) We analysed Pfam consistency of clusters binned by their LDDT of each cluster. These bins represent Pfam annotated non-singleton clusters equally.

**Extended Data Fig. 3. Relationship of mean pairwise EC number consistency to LDDT of cluster.**
These graphs are plotted with 113,287 clusters with at least two Enzyme Commission number annotated sequences. Each panel describes EC consistency compared at 1 to 4 classes. Each bin in a panel represents EC annotated non-singleton clusters equally.

**Extended Data Fig. 4. Examples of non-compact AlphaFold2 predicted structures.**
Examples of representative structures of clusters without annotations having pLDDT>90 and a predicted pocket covering over 80% of the residues of the structure.

**Extended Data Fig. 5. Top predicted molecular functions in all 712k dark clusters with DeepFRI scores greater than 0.5.**
The graph displays the most frequent molecular functions predicted by DeepFRI with prediction scores above 0.5 across all 712k dark clusters, highlighting the prevalence of the keyword “transmembrane”. Only 98,882 (13.9%) out of the 712K have a prediction score greater than 0.5.

**Extended Data Fig. 6. LCA plot of the clusters that contain Homo Sapiens proteins.**
Lowest common ancestor Sankey plot generated by Pavian for all clusters containing human proteins.

**Extended Data Fig. 7. Additional examples of human related proteins in structural clusters with representatives or partial matches in bacterial species.**
(a) We found bacterial structures related to the human CD4 like protein B4E1T0. The human protein (B4E1T0) has 3 Pfams - PF05790, PF09191, PF12104. Those Pfams are specific to Eukaryotes only. In contrast, the bacterial protein (A0A1F4ZDN5) has no Pfam annotation. (b) The human protein (B4DKH6) is a bactericidal permeability-increasing protein found in humans. The *E. coli* protein (P0AB26) has a similar structure to the human protein, contains a Pfam domain of unknown function (DUF) and its structure is also experimentally determined (PDB: 3l6i B).

**Extended Data Fig. 8. Comparison of predicted structures of homologous proteins: *Lachnospiraceae* bacterium to *Clostridium*.**
(a) pLDDT and multiple-sequence-alignment coverage output produced by ColabFold for the prediction of the protein sequence of *Lachnospiraceae*. (b) The predicted structure of RJW57900.1. (C) Superposition of the *Clostridium* protein structure with *Lachnospiraceae* with the DNA binding domain being well superposable.

See this image and copyright information in PMC

Comment in

Large-scale clustering of AlphaFold2 3D models shines light on the structure and function of proteins.
Bordin N, Lau AM, Orengo C. Bordin N, et al. Mol Cell. 2023 Nov 16;83(22):3950-3952. doi: 10.1016/j.molcel.2023.10.039. Epub 2023 Nov 16. Mol Cell. 2023. PMID: 37977115

References

1. Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. - DOI - PMC - PubMed
1. Varadi M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. - DOI - PMC - PubMed
1. Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. - DOI - PMC - PubMed
1. Chowdhury R, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 2022;40:1617–1623. doi: 10.1038/s41587-022-01432-w. - DOI - PMC - PubMed
1. Terwilliger, T. C. et al. AlphaFold predictions: great hypotheses but no match for experiment. Preprint at bioRxiv10.1101/2022.11.21.517405 (2022).

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering predicted structures at the scale of the known protein universe

Affiliations

Clustering predicted structures at the scale of the known protein universe

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources