Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Nov;386(6721):eadq4946.
doi: 10.1126/science.adq4946. Epub 2024 Nov 1.

Exploring structural diversity across the protein universe with The Encyclopedia of Domains

Affiliations

Exploring structural diversity across the protein universe with The Encyclopedia of Domains

Andy M Lau et al. Science. 2024 Nov.

Abstract

The AlphaFold Protein Structure Database (AFDB) contains more than 214 million predicted protein structures composed of domains, which are independently folding units found in multiple structural and functional contexts. Identifying domains can enable many functional and evolutionary analyses but has remained challenging because of the sheer scale of the data. Using deep learning methods, we have detected and classified every domain in the AFDB, producing The Encyclopedia of Domains. We detected nearly 365 million domains, over 100 million more than can be found by sequence methods, covering more than 1 million taxa. Reassuringly, 77% of the nonredundant domains are similar to known superfamilies, greatly expanding representation of their domain space. We uncovered more than 10,000 new structural interactions between superfamilies and thousands of new folds across the fold space continuum.

PubMed Disclaimer

Conflict of interest statement

Competing interests

Authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1. Overall workflow.
(a) i. 214 million AFDB target sequences are filtered by 100% sequence identity in order to avoid bias. This identifies 188 million non-redundant targets (TED-100) and a set of sequence-redundant targets (TED-redundant). ii. Both TED-100 and TED-redundant undergo automated domain parsing, with assignments derived from consensus among the three methods. iii. TED-100 domains are processed by MMseqs2, creating over 121 million clusters at 50% identity. Concurrently, domains are matched to CATH domains via Foldseek and Merizo-search, categorised into superfamily (C.A.T.H), topology (C.A.T), or no-matches. Domains found by Merizo-search nearest neighbour matches are considered as topology-level matches. Clusters are annotated with CATH labels, creating partially labelled and unlabelled clusters. Low-quality domains in unlabeled clusters are filtered out. iv. Resultant domains undergo a new workflow for identification, involving clustering and database searches for matches to known structures. Poor quality domains (non-protein-like) are identified using an in-house deep learning method (Methods). Novel domains are additionally scored on internal symmetry using the SymD program (Methods). (b) Full-length targets are subjected to automated domain parsing by Merizo, Chainsaw and UniDoc. A consensus is taken by identifying assignments where three (high), two (medium) methods agree or no consensus is found (low). Only high and medium consensus domains are analysed further. (c) Comparison of domains identified by sequence (Pfam and Gene3D) versus structure-based methods (TED). The "TED" count combines TED-100 and TED-redundant. (d) i. Domain length distribution and proportion of identified continuous (blue) and discontinuous (orange) domains. Inset shows proportion of single, multi-domain and number of targets with no identified domains (n=188,914,411). ii. Average plDDT distribution for TED-100 domains (n=324,389,697) across confidence bins: dark blue/very high (plDDT >= 90), blue/high (90 > plDDT >= 70), yellow/low (70 > plDDT >= 50), and orange/very low (plDDT < 50).
Fig. 2
Fig. 2. Classification of TED domains using the CATH hierarchy.
(a) i. The top 100 superfamilies in TED-100 for each CATH class where more matches to CATH superfamilies have been identified via structural hits in TED, compared to sequence hits in Gene3D. ii. Proportion of domains matched to CATH classes (n=238,569,631). (b) Enrichment of superfamily representation in TED-100 compared to PDB and Gene3D. The top 5 superfamilies of each CATH class are shown, where enrichment in TED-100 compared to PDB is the greatest. Colour scale represents fold-change in superfamily representation in PDB and Gene3D compared to Gene3D and TED. A full list of fold names corresponding to the CATH superfamily codes can be found in Table S2. (c) Expansion of CATH superfamilies to new superkingdoms in TED. Plot shows the number of unique superfamilies found in each superkingdom (across the 653,460 taxa of TED-100) according to Gene3D and TED assignments. Each column along the horizontal axis depicts the number of superfamilies that are exclusive to a single superkingdom when only considering Gene3D assignments, but are expanded into one, two or three additional superkingdoms in TED. Only superfamilies where Gene3D domains are exclusive to a single superkingdom are shown (n=1061). (d) Exclusivity of CATH topologies across superkingdoms. PCA of normalised CATH topology counts across five superkingdoms: eukarya (Eu), bacteria (Ba), archaea (Ar), unclassified and other sequences (Un). The ‘mixed’ category comprises topologies found in roughly equal proportions in Eu/Ba domains. Examples of superkingdom-exclusive topologies are shown for each category.
Fig. 3
Fig. 3. Examples of high-symmetry domains and extruded repeats.
Domains are identified as part of the novel domain identification pipeline and are identified as domains with high internal symmetry via scoring with the SymD program (Methods). Extruded repeats are domains with a high number of ordered cyclical repeats projecting along one axis. Colouration follows plDDT confidence bins as per the AFDB (dark blue/very high: plDDT >= 90, blue/high: 90 > plDDT >= 70, yellow/low: 70 > plDDT >= 50 and orange/very low: plDDT < 50).
Fig. 4
Fig. 4. Examples of novel domain clusters identified in TED.
(a) Comparison of domain novelty score versus sequence cluster size (n=7427). Novelty scores are predicted by the Foldclass algorithm where novel domains are ranked with a score close to 100. (b) Taxonomic distribution of novel domain clusters (for all sequence cluster members; n=483,732). Largest common phyla are shown across superkingdoms along with the number of domains in sequence clusters assigned to each level of the hierarchy. (c) Subpanels i-v correspond to labels shown in panel (a). In panel i, the bottom sub-panel shows the arrangement of strands that form the coiled hairpin loop from the N-terminus (blue) to C-terminus (red). The quoted cluster size represents the number of identified homologues at the sequence cluster level. Labels denoting superkingdoms correspond to panel (b) and represent the superkingdom that all cluster members belong to. The cluster is distributed across multiple superkingdoms when multiple labels are shown. (d) Examples of high-novelty structures. In panel ii, the bottom sub-panel shows the arrangement of helices that form the coiled hairpin loop from the N-terminus (blue) to C-terminus (red). Asterisks denote where organism names have been shortened: iii. Marinomonas mediterranea (strain ATCC 700492 / JCM 21426 / NBRC 103028 / MMB-1), v. Globisporangium ultimum (strain ATCC 200006 / CBS 805.95 / DAOM BR144) (Pythium ultimum). (e) Novel folds with predicted functions. i. Example of a domain predicted to have nucleic acid and zinc binding properties. Potential zinc binding site residues are highlighted as sticks. The left-hand site is composed of 2 Cys and 2 His residues, whereas the right-hand site has 3 Cys and 1 His in a tetrahedral arrangement. ii. Example of a heme binding domain. The residues of the heme c binding motif are highlighted.
Fig. 5
Fig. 5. Interacting superfamily pairs (ISPs).
(a) Enrichment of the number of instances of ISPs common to the CATH and TED datasets, expressed as log2(fold change) (n=3070). (b) i. Alignment procedure used to compute CIO values for an ISP. One domain in each instance of each ISP is used as a reference and aligned to a designated ‘master’ reference domain structure. The rotation and translation from each alignment is applied to the second, ‘tag-along’ domain to bring all domain pairs into a common frame. Vectors are then computed between the centres of mass of each pair of domains, and used to compute the CIO measure (see Methods). ii. Comparison of CIO values for ISPs common to CATH and TED. Most ISPs show a high degree of conservation in interaction patterns. (c) i. Hierarchical edge bundling plots illustrating differences in domain superfamily interaction patterns between CATH (left) and TED (right). Curves in the plots connect interacting superfamilies. Hubs are marked by medium (4-7 connections) and large circles (>7 connections) on the outer rim. ii. Comparison of hub domains in CATH and TED. The heatmap compares CATH superfamilies in CATH and TED as hubs, categorised as small (<4 connections), medium (5-7), and large (≥8). Hub thresholds used are from Ekman et al. (33). (d) Two examples of new hub superfamilies in TED, with groups of domains for interacting superfamilies placed in a common frame and represented as volumes, alongside chains involved in each group and a graph representation of each hub. The sets of interactions for superfamily 3.40.50.10850 (NtrC-like protein domain) are disjoint between CATH and TED, whereas the set of TED interactions for superfamily 2.40.30.200 (Distal tail protein domain) include that seen in CATH (orange and green). A decomposed view of d.i. appears in Fig. S15.

References

    1. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Žídek A, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. - DOI - PMC - PubMed
    1. Varadi M, Bertoni D, Magana P, Paramval U, Pidruchna I, Radhakrishnan M, Tsenkov M, Nair S, Mirdita M, Yeo J, Kovalevskiy O, et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2023 doi: 10.1093/nar/gkad1011. - DOI - PMC - PubMed
    1. Borkakoti N, Thornton JM. AlphaFold2 protein structure prediction: Implications for drug discovery. Curr Opin Struct Biol. 2023;78:102526. doi: 10.1016/j.sbi.2022.102526. - DOI - PMC - PubMed
    1. Durairaj J, Waterhouse AM, Mets T, Brodiazhenko T, Abdullah M, Studer G, Tauriello G, Akdel M, Andreeva A, Bateman A, Tenson T, et al. Uncovering new families and folds in the natural protein universe. Nature. 2023;622:646–653. doi: 10.1038/s41586-023-06622-3. - DOI - PMC - PubMed
    1. Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist CLM, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature. 2023;622:637–645. doi: 10.1038/s41586-023-06510-w. - DOI - PMC - PubMed

Publication types

LinkOut - more resources