Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct;622(7983):637-645.
doi: 10.1038/s41586-023-06510-w. Epub 2023 Sep 13.

Clustering predicted structures at the scale of the known protein universe

Affiliations

Clustering predicted structures at the scale of the known protein universe

Inigo Barrio-Hernandez et al. Nature. 2023 Oct.

Abstract

Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. The AFDB, structural clustering workflow and summary of the clusters.
a, The AFDB started as a collaborative effort between EMBL-EBI and DeepMind in 2021. The database grew in multiple stages, with the latest version of 2022 containing over 214 million predicted protein structures and their confidence metrics. b, A two-step approach was used to cluster proteins in the database. First, MMseqs2 was used to cluster 214 million UniProtKB protein sequences (AFDB) on the basis of 50% sequence identity and 90% sequence overlap, resulting in a reduction of the database size to 52 million clusters (AFDB50). For each cluster, the protein with the highest pLDDT score was selected as the representative. Next, using Foldseek, the representative structures were clustered into 18.8 million clusters (Foldseek clusters) without a sequence identity threshold, but still enforcing a 90% sequence overlap and an E-value of less than 0.01 for each structural alignment. As the last step, we removed all sequences labelled as fragments from the clustering, ending up with 2.30 million clusters with at least two structures (AFDB clusters). c, AFDB cluster structural and Pfam consistency. Our clusters have a median LDDT of 0.77 and a median TM score of 0.71 across all clusters and 66.5% of clusters with Pfam annotations are 100% consistent. d, Summary of sequences and clusters with and without annotation (left) and the relationship of cluster sizes to annotation (right). From left to right, each bin occupies AFDB clusters at rates of 12.24%, 10.59%, 9.20%, 10.07%, 10.46%, 10.05%, 9.04%, 9.20%, 9.19% and 9.96%.
Fig. 2
Fig. 2. Putative novel enzymes and small-molecule-binding proteins in structures lacking annotation.
a, Counts of GO molecular function terms that are most often predicted by DeepFRI on the set of selected 1,707 structures with predicted pockets. bd, Examples of structures (A0A849TG76 and A0A2D8BRH7 (b), A0A849ZK06 (c) and S0EUL8 (d)) with predicted pockets and functional annotations. Each example shows the UniProt ID (top), the highest-scoring DeepFRI function prediction (bottom) and the top-scoring pocket (pink surface). The structures are coloured by residue-level contributions to the DeepFRI function predictions, ranging from blue (no contribution) to yellow (strong contribution).
Fig. 3
Fig. 3. Evolutionary distribution of clusters and human-centric cluster analysis.
a, Visualization of the LCA of all non-singleton clusters as a Sankey plot produced by Pavian. Only the largest 13 taxonomical nodes per rank are shown. b, The distribution of selected GO terms across the human lineage of the LCA based on the analysis of human protein-containing clusters (abundance is normalized per GO category). c, Three example structures from the human clusters that are conserved across humans and bacteria, among the eukaryote GO-annotated clusters. A histone protein with a nucleus GO annotation, which was found to be conserved at the cellular organism level and supports the previously reported evolutionary connection between eukaryotic and bacterial histones (left). The human innate immunity genes BPI (middle) and AIM2 (right) encode structurally similar proteins in bacterial species, highlighting the potential for cross-kingdom sharing of immunity-related proteins. Acido. bacteriumAcidobacteria bacteriumActino. bacterium, Actinomycetia bacterium; ‘Ca. Bathyarchaeota archaeon’, ‘Candidatus Bathyarchaeota archaeon’; D. bacterium, Deltaproteobacteria bacteriumH. pylori, Helicobacter pylori; memb., membrane; P. bacterium, Planctomycetes bacterium; R. irregularis, Rhizophagus irregularisS. enterica, Salmonella entericaT. cinerariifolium, Tanacetum cinerariifolium.
Fig. 4
Fig. 4. Prediction of domain families by local structural similarity hits.
a, Diagram of the structure-based domain family prediction method. Clustering of the start and end positions for Foldseek hits of one protein against all others was used to define potential domain boundary positions. Each predicted domain region was linked to the others sharing structural similarities and graph-based clustering was used to define domain families and interdomain similarity. b, The frequency distribution of the most common (n = 9,631) and the second most common (n = 1,628) Pfam annotations found members of all predicted domain families. anno., annotation. c, The counts of the number of clusters with a given Pfam as the most frequent. d, The number of domain family clusters annotated to a Pfam, DUF or no domain annotation. e, The distribution of protein region length in the predicted domain families, stratified by their annotations: Pfam domain (n = 1,048,276), DUF (n = 72,798) and not annotated (n = 1,904,498). f, Non-redundant count of Pfam and DUF domain families found in the structure-based predicted families. g, The distribution of the number of structures found for each predicted domain family annotated with a known Pfam (n = 3,875) or DUF domain (n = 1,513). The top 6 Pfam annotations are highlighted using their abbreviations: Pkinase, protein kinase domain PF00069); zf-C2H2, zinc finger, C2H2 type PF00096; Ank_2, ankyrin repeats, PF12796; RVT_1, reverse transcriptase, PF00078; WD40, WD domain, G-beta repeat PF00400; ABC_tran, ABC transporter, PF00005. The box plots in b, e and g show the median (centre line), the quartiles 1 and 3 (box limits) and 1.5 × the interquartile range (whiskers).
Fig. 5
Fig. 5. Examples of non-annotated domain families with structural similarity to annotated domain families.
a, Frag1-like domains. Three clusters were found enriched for the Frag1 Pfam annotation that had structural similarity to one cluster enriched for a domain of unknown significance (DUF998) and one cluster without annotations. b, Anthrax_toxA-like domains. A cluster enriched for the anthrax_toxA Pfam annotation was found with structural similarity to a cluster with no annotations. c, Two clusters without annotations were found with structural similarity to a cluster enriched for the gasdermin Pfam annotations. Cluster 2 gasdermin N-terminal domain structures reveal homology to human gasdermin E. The corresponding structural characteristics are highlighted. Some gasdermin domains were found fused to protease domains (UniProt: A0A2C5ZLK3). The bacterial gasdermin structure (PDB: 7N51) is similar to novel gasdermin domains from non-annotated cluster 2. The third cluster revealed homology to both animal and bacterial gasdermins.
Extended Data Fig. 1
Extended Data Fig. 1. The five-step clustering pipeline for efficiently clustering millions of protein structures using Foldseek’s 3Di alphabet.
(1) Protein structures are converted to 3Di sequences and processed through the Linclust workflow. (2) For each sequence, 300 min-hasing k-mers are extracted and sorted. (3) The longest structure is assigned to be the centre of each k-mer cluster. (4) Structural alignment is performed in two stages: first an ungapped alignment based on shared diagonal information is performed, hits are pre-clustered and second the remaining sequences are aligned using Foldseek’s structural Smith-Waterman. (5) The remaining structures meeting alignment criteria are clustered using MMseqs2’s clustering module. After the Linclust step the centroids are successively clustered by three cascaded steps of prefiltering, structural Smith-Waterman alignment and clustering using Foldseek’s search.
Extended Data Fig. 2
Extended Data Fig. 2. Relationship of mean pairwise Pfam consistency to cluster features. These graphs are plotted with 1,004,422 clusters with at least two Pfam annotated sequences.
(a) We analysed Pfam consistency of clusters binned by their member counter. These bins represent Pfam annotated non-singleton clusters at rates of 19.2%, 13.5%, 9.5%, 12.6%, 11.0%, 12.4%, 11.8% and 10.0% from left to right, respectively. (b) We analysed Pfam consistency of clusters binned by their LDDT of each cluster. These bins represent Pfam annotated non-singleton clusters equally.
Extended Data Fig. 3
Extended Data Fig. 3. Relationship of mean pairwise EC number consistency to LDDT of cluster.
These graphs are plotted with 113,287 clusters with at least two Enzyme Commission number annotated sequences. Each panel describes EC consistency compared at 1 to 4 classes. Each bin in a panel represents EC annotated non-singleton clusters equally.
Extended Data Fig. 4
Extended Data Fig. 4. Examples of non-compact AlphaFold2 predicted structures.
Examples of representative structures of clusters without annotations having pLDDT>90 and a predicted pocket covering over 80% of the residues of the structure.
Extended Data Fig. 5
Extended Data Fig. 5. Top predicted molecular functions in all 712k dark clusters with DeepFRI scores greater than 0.5.
The graph displays the most frequent molecular functions predicted by DeepFRI with prediction scores above 0.5 across all 712k dark clusters, highlighting the prevalence of the keyword “transmembrane”. Only 98,882 (13.9%) out of the 712K have a prediction score greater than 0.5.
Extended Data Fig. 6
Extended Data Fig. 6. LCA plot of the clusters that contain Homo Sapiens proteins.
Lowest common ancestor Sankey plot generated by Pavian for all clusters containing human proteins.
Extended Data Fig. 7
Extended Data Fig. 7. Additional examples of human related proteins in structural clusters with representatives or partial matches in bacterial species.
(a) We found bacterial structures related to the human CD4 like protein B4E1T0. The human protein (B4E1T0) has 3 Pfams - PF05790, PF09191, PF12104. Those Pfams are specific to Eukaryotes only. In contrast, the bacterial protein (A0A1F4ZDN5) has no Pfam annotation. (b) The human protein (B4DKH6) is a bactericidal permeability-increasing protein found in humans. The E. coli protein (P0AB26) has a similar structure to the human protein, contains a Pfam domain of unknown function (DUF) and its structure is also experimentally determined (PDB: 3l6i B).
Extended Data Fig. 8
Extended Data Fig. 8. Comparison of predicted structures of homologous proteins: Lachnospiraceae bacterium to Clostridium.
(a) pLDDT and multiple-sequence-alignment coverage output produced by ColabFold for the prediction of the protein sequence of Lachnospiraceae. (b) The predicted structure of RJW57900.1. (C) Superposition of the Clostridium protein structure with Lachnospiraceae with the DNA binding domain being well superposable.

Comment in

References

    1. Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. - DOI - PMC - PubMed
    1. Varadi M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. - DOI - PMC - PubMed
    1. Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–876. doi: 10.1126/science.abj8754. - DOI - PMC - PubMed
    1. Chowdhury R, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 2022;40:1617–1623. doi: 10.1038/s41587-022-01432-w. - DOI - PMC - PubMed
    1. Terwilliger, T. C. et al. AlphaFold predictions: great hypotheses but no match for experiment. Preprint at bioRxiv10.1101/2022.11.21.517405 (2022).

Publication types