Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct;622(7983):646-653.
doi: 10.1038/s41586-023-06622-3. Epub 2023 Sep 13.

Uncovering new families and folds in the natural protein universe

Affiliations

Uncovering new families and folds in the natural protein universe

Janani Durairaj et al. Nature. 2023 Oct.

Abstract

We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database1. These models cover nearly all proteins that are known, including those challenging to annotate for function or putative biological role using standard homology-based approaches. In this study, we examine the extent to which the AlphaFold database has structurally illuminated this 'dark matter' of the natural protein universe at high predicted accuracy. We further describe the protein diversity that these models cover as an annotated interactive sequence similarity network, accessible at https://uniprot3d.org/atlas/AFDB90v4 . By searching for novelties from sequence, structure and semantic perspectives, we uncovered the β-flower fold, added several protein families to Pfam database2 and experimentally demonstrated that one of these belongs to a new superfamily of translation-targeting toxin-antitoxin systems, TumE-TumA. This work underscores the value of large-scale efforts in identifying, annotating and prioritizing new protein families. By leveraging the recent deep learning revolution in protein bioinformatics, we can now shed light into uncharted areas of the protein universe at an unprecedented scale, paving the way to innovations in life sciences and biotechnology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. General workflow for the collection, classification and mapping of functionally dark proteins in UniProt and AFDB.
a, Starting from the clusters in UniRef50, we collected all the functional annotations for all included UniProtKB and UniParc entries, including domain (D), coiled-coil (CC) and intrinsically disordered (IDPs) predictions and excluding all of those with putative, hypothetical, uncharacterized and DUF in their names. Cx corresponds to the coverage of an annotation, Ci corresponds to the functional brightness across the entire sequence. We selected the protein with the highest full-length annotation coverage (that is, brightness, Ci) as the functional representative of each cluster. b, From the collected UniRef50 clusters, we selected those with a structural representative with pLDDT greater than 90 in the AFDB v.4, and constructed a large-scale sequence similarity network by all-against-all MMseqs2 searches, representing the sequence landscape of more than 6 million UniRef50 clusters.
Fig. 2
Fig. 2. Large-scale sequence similarity network for more than 6 million UniRef50 cluster representatives with high predicted accuracy models in AFDB (AFDB90).
a, Layout of the resulting network, as computed with Cosmograph (https://cosmograph.app/). The network contained 4,270,404 nodes connected by 10,339,158 edges, reduced for simplicity to a set of 688,852 communities connected by a total of 1,488,764 edges (see Methods section ‘Large-scale sequence similarity network’ for details). The 1,865,917 UniRef50 clusters that did not connect to any other in the MMseqs2 searches were excluded. Only the 473,612 communities that have at least one inbound or outbound edge (degree of 1) are shown in the figure. Nodes are coloured by the average functional brightness of the UniRef50 clusters included in the corresponding community. An interactive version is available at https://uniprot3d.org/atlas/AFDB90v4. b, Histograms of functional brightness content for connected components with more than 50,000 and with only five to two nodes (UniRef50 clusters), highlighting their different darkness content. c, Scatter plot of the component size (that is, number of UniRef50 clusters) cut-off and the percentage of functionally dark UniRef50 clusters. d, Histogram of the average (avg.) brightness per connected component. e,f, Size distribution for fully dark connected components (e, average brightness less than 5%) and fully bright connected components (f, average brightness more than 95%).
Fig. 3
Fig. 3. Connected component 27 is a new family in a well-studied superfamily of transmembrane glycosyltransferases.
a, High-resolution sequence similarity network for 7,004 homologues of the sequences in component 27, computed with CLANS at an E value threshold of 1 × 10−20. Points represent individual proteins and grey lines BLASTp matches at an E value < 1 × 10−20. Individual clusters are coloured and labelled accordingly to their representative members. Only YfhO-like and STT3/PglB sequences are highlighted, with grey dots depicting other homologous groups. AglB corresponds to the PglB/STT3-like sequences from archaea. Black dots depict those sequences that make component 27 in our network, and white dots mark those that are bright. b, Predicted structural models as in AFDBv4 for the representative of component 27 (C27, UniProt ID A0A7X7MB17) and YfhO (UniProt ID YFHO_BACSU), and experimental structures of the PglB (PDB ID 6GXC, chain A) and STT3 (PDB ID 7OCI, chain F) cluster representatives. Models are coloured according to their corresponding cluster in a. The membrane regions, as predicted with the PPM v.3.0 server, are marked by dashed lines.
Fig. 4
Fig. 4. Connected component 159 is a new toxin in the hitherto undescribed TA superfamily TumE–TumA.
a, High-resolution sequence similarity network for 2,453 homologues of the sequences in component 159, computed with CLANS (E value cut-off of 1 × 10−10). Points represent proteins and grey lines BLASTp matches (E value < 1 × 10−4). Individual subclusters are labelled 1–7 and subclusters a–c. The consensus genomic contexts, as identified by GCsnap, are shown with different flanking families coloured from blue to red. b, A 3D model of the complex between the putative toxin and antitoxin from A. tepidum strain NZ, modelled with AlphaFold-Multimer, highlighting the regions where DNA is predicted to interact with the antitoxin. c, Structural model of A. tepidum TumE/DUF6516 toxin (EntrezID WP_213381069.1) coloured according to the two most frequent molecular functions predicted for 100 homologues with DeepFRI. Residues responsible for the predictions are highlighted in red. The percentage reflects the frequency of the highlighted prediction. d, Validation of tumEtumA. Plasmids for expression of putative toxins (pBAD33 derivates) were cotransformed into E. coli BW25113 cells with antitoxin expression plasmids or the empty pMG25 vector. Bacteria were grown for 5 h in liquid LB medium supplemented with appropriate antibiotics and 0.2% glucose. The cultures were normalized to OD600 = 1.0, serially diluted and spotted on LB plates containing appropriate antibiotics and 0.2% arabinose for toxin induction and 500 µM IPTG for antitoxin induction. The plates were scored after an overnight incubation at 37 °C. e, Metabolic labelling assays with E. coli BW25113 expressing A. tepidum TumE/DUF6516 toxin. Error bars indicate the standard error of the arithmetic mean. All experiments shown in d and e were performed as n = 3 biologically independent replicates (individual independent cultures). All repetitions of the experiments shown in d yielded similar results. Source Data
Fig. 5
Fig. 5. Structural outliers can represent fragments, repetitive proteins, proteins requiring folding conditions out of the scope of AlphaFold2 or new folds.
a,b, Distribution of brightness, shape-mer diversity and length of the structural outliers (a) and the same number of structural inliers (b) with the most positive outlier scores. Shape-mer diversity is defined as the number of unique shape-mers by the length of the protein. c, An AFDB model of TonB-dependent receptor-like protein that is a fragment of the β-barrel domain. More than 16,500 proteins across 1,258 components have this annotation, of which 86% are fully bright. From these, 82% have fewer than the required number of β-sheet shape-mers, despite 55% not being explicitly annotated as fragments in UniProtKB. d, Two long repetitive outliers, one belonging to the PE-PGRS superfamily (G0TGH8), thought to be new folds and found widely in mycobacteria, and one to the Tetratricopeptide-like helical domain superfamily (A0A015IZK3) in which the median PDB structure length of structures with resolution less than 3 Å is only 370. e, AFDB model annotated as containing ‘putative type VI secretion system, Rhs element associated Vgr domain’ (A0A377W562), a trimeric PDB structure (PDB ID 6SK0) also containing this domain and an AlphaFold-Multimer model of the A0A377W562 trimer that has 1.1 Å r.m.s.d. to the PDB structure. The AFDB model does not resemble the PDB structure because these proteins form obligate complexes and adopt a trimeric β-solenoid fold. f, AlphaFold models of different variations of the β-flower, with positively charged residues in red and phenylalanine in green for A0A494VZL1, and PDB structure of the human Tubby C-terminal domain (PDB ID 2FIM). Black arrows indicate the circularly permuted loop in A0A0S7BXY3 and PDB ID 1ZXU. g, AlphaFold model of A0A0S7BXY3 and PDB structure of Arabidopsis thaliana putative phospholipid scramblase (PDB ID 1ZXU). Black arrows indicate the circularly permuted loop.
Extended Data Fig. 1
Extended Data Fig. 1. Distribution of functional darkness in UniProt and AFDB (version 4).
Functional brightness distribution in (a) UniRef50, (b) UniRef50 clusters with models in AFDB (which excludes long proteins, and those UniRef50 clusters composed solely of UniParc entries and viral proteins), (c) UniRef50 clusters whose best structural representative has an average pLDDT > 70, and (d) UniRef50 clusters whose best structural representative has an average pLDDT > 90. For each set, the percentage of fully dark UniRef50 clusters, and corresponding brightness bin, are highlighted in purple. The bar associated with functionally bright UniRef50 clusters (functional brightness >95%) is marked in white. (e) Percentage of fully dark UniRef50 clusters with proteins annotated as a domain of unknown function (DUF) in each set a-e.
Extended Data Fig. 2
Extended Data Fig. 2. Structural conservation and structure-based function prediction of TumE.
Structural superposition of five randomly selected members of component 159 (UniProt IDs A0A0E3S9F7, A0A3R7AQ40, A0A520JWH3, A0A1W9UY89, A0A7J4P9B0) with secondary structure elements labelled.
Extended Data Fig. 3
Extended Data Fig. 3. Testing the toxicity of putative TumA antitoxins.
Antitoxin expression plasmids were cotransformed with empty toxin expression vectors (pBAD33) into E. coli BW25113 cells. The bacterial cultures were started from a single colony and grown for five hours in liquid LB media supplemented with appropriate antibiotics. The cultures were normalised to OD600 = 1.0, serially diluted and spotted on LB agar plates containing appropriate antibiotics and 500 µM IPTG for antitoxin induction and 0.2% arabinose to mimic the conditions in toxin neutralization assay. The experiment was made in n = 3 biologically independent replicates.
Extended Data Fig. 4
Extended Data Fig. 4. Diversity of the (a) names predicted by ProtNLM and (b) their word composition, as well as the (c) fraction of structural outliers, for all fully dark and fully bright connected components.
Name diversity is calculated as the number of unique protein names within a component by the total number of component proteins. Word diversity is calculated as the number of unique words across all protein names within a component by the total number of words, ignoring the words “protein”, “domain”, “family”, “containing”, and “superfamily”. Outlier content is calculated as the percentage of UniRef50 clusters with negative structural outlier scores within that component. Fully bright and fully dark distributions were compared using a two-sided Kolmogorov–Smirnov test, resulting in a test statistic of 0.2915 and P-value = 8.8829 × 10−16 for (b) and test statistic 0.05859 and P-value = 5.245 × 10−81 for (c).
Extended Data Fig. 5
Extended Data Fig. 5. The highly semantically diverse prophage-associated connected components 3,314 and 6,732.
(a) Sequence similarity network of homologs of members of connected component 3,314 and the tubulin-binding domain of TRAF3-interacting protein 1, as computed with CLANS at an E value threshold of 1 × 10−5. Points represent individual proteins and grey lines BLASTp matches at an E-value better than 1 × 10−4. Individual subclusters are labelled 1-2 and structural representatives are shown. For subcluster 1, 5 randomly selected structural representatives of component 3,314 are superposed (UniProt IDs A0A0F9A5W1, A0A0P9GTS8, A0A418VYX3, A0A2S5M855, A0A2K2VML8). For subcluster 2, the tubulin-binding domain of Chlamydomonas reinhardtii TRAF3-interacting protein 1 (PDB ID 5FMT, chain B) is shown. (b) Genomic context conservation of 30 sequences from subcluster 1 with a maximum sequence identity of 30%, as computed with GCsnap. (c) Structure superposition of component 6,732 representative (A0A098EYB0, purple) and mismatch restriction endonuclease EndoMS (PDB ID 5GKH, chain A, grey). The grey box indicates the active site pocket with conserved residues labelled. Note that the residue D165 corresponding to D86 is mutated to alanine in the PDB structure. Structural homologs were searched both with Foldseek, which resulted in a hit to Cas4 endonuclease PDB ID 8D3P with TM-score 0.34, and Dali multiple hits to restriction endonucleases, the top-ranking with a Z-score of 8.2.
Extended Data Fig. 6
Extended Data Fig. 6. An example of substructure decomposition.
(a) An example AlphaFold protein model with its 6 most common shape-mers highlighted in different colours. Spheres mark the shape-mer central residue and backbone atoms within 4 Å are coloured. (b-g) Four random representatives of each selected shape-mer, obtained from CATH proteins with <20% sequence identity. Spheres depict positions within 8 residues in sequence and 10 Å spatially from the central residue.
Extended Data Fig. 7
Extended Data Fig. 7. Shape-mer representations combined with FastText can discriminate between protein families.
(a) Cumulative distributions of sensitivity for homology detection on the SCOPe40 database of single-domain structures. True positives (TPs) are matches within the same SCOPe family, false positives (FPs) are matches between different folds. Sensitivity is the area under the ROC curve up to the first FP. Results based on shape-mer FastText Smith-Waterman alignment are shown in black. (b) Protein-level embedding distance measured as the cosine distance of FastText sentence vectors for proteins within the same SCOPe family (top) and from different SCOPe folds (bottom).

Comment in

References

    1. Varadi M, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022;50:D439–D444. doi: 10.1093/nar/gkab1061. - DOI - PMC - PubMed
    1. Mistry J, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2020;49:D412–D419. doi: 10.1093/nar/gkaa913. - DOI - PMC - PubMed
    1. UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51:D523–D531. doi: 10.1093/nar/gkac1052. - DOI - PMC - PubMed
    1. Richardson L, et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 2023;51:D753–D759. doi: 10.1093/nar/gkac1080. - DOI - PMC - PubMed
    1. Boutet E, et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. Methods Mol. Biol. 2016;1374:23–54. doi: 10.1007/978-1-4939-3167-5_2. - DOI - PubMed

Publication types