Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 8;6(1):160.
doi: 10.1038/s42003-023-04488-9.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Affiliations

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Nicola Bordin et al. Commun Biol. .

Abstract

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the CATH-Assign protocol used to process the predicted AF2 domains.
CATH-HMM (labelled as CATH) are structurally compared against the Superfamily non-redundant representative that they match. Pfam and NewFam domains are classified into CATH Superfamilies using the CATHe predictor where possible. A cascade method is used to validate, starting with structure scans against non-redundant domains in the CATHe predicted Superfamily, then the predicted Topology, Architecture, and finally domains from all superfamilies if necessary.
Fig. 2
Fig. 2. Average model quality.
The plots show the distribution of average pLDDT scores for domains divided by source. The pLDDT threshold for confident model quality is highlighted (≥70).
Fig. 3
Fig. 3. CATH coverage of the AlphaFold2 dataset.
a Overview of domain quality and ontology for the total AlphaFold2 dataset and b subdivided by each proteome.
Fig. 4
Fig. 4. Structural coverage expansion.
a Expansion in structural coverage by total number of structural domains and b fold-wise by validated CATH-HMM, Pfam and NewFams domains models for the 21 organisms in the AlphaFold2 dataset.
Fig. 5
Fig. 5. Structural coverage expansion of CATH FunFams by AlphaFold domains.
Initial coverage by CATH/PDB (blue), additional coverage by AlphaFold2 models (orange) and unannotated (green).
Fig. 6
Fig. 6. Functional diversity between protein families revealed using Alphafold2: The HUP superfamily as an example.
Members of HUP superfamily (CATH ID:3.40.50.620) possess a common structural core comprising a Rossmann αβα-sandwich fold. Several functional families within the HUP superfamily lacking a representative PDB structure now possess a representative domain from Alphafold2 enabling characterisation of putative functional sites in their associated functional families. For example, the Phosphoadenosine phosphosulfate reductase-like protein family (PPR) (CATH FunFam ID: FunFam-348; EC: 1.8.4.8) has a high-quality AF2 domain available for the poorly studied PPR protein from Leishmania infantum (af_A4I3B1_2_215; pLDDT: 94.87). This protein has no close homologue in the PDB (>30% sequence identity). We compared the AF2 model with the representative PDB structure (1zunA) from its closest Functional Family in CATH i.e., Sulphate adenylyltransferase family (SAT) [FunFam-02, representative PDB:1zunA; EC:2.7.7.4]. a Superposition of structure representatives from FunFam-348-PPR and FunFam-02-SAT. Residues conserved in both families are coloured green, FunFam-specific residues, blue (af_A4I3B1_2_215, for FunFam-348-PPR) and red (1zunA, for FunFam-02-SAT). b FunFam-2-SAT is involved in ATP hydrolysis, an essential process for its function. Analysis of conserved residues using Scorecons indicated that most active site residues (shown in red) are conserved differently between the two functional families. Moreover, there is a change in catalytic residue site (indicated as blue *) in PPR i.e., F(/L)209Y. The height of each residue indicates its degree of conservation. c Close-up view of differentially conserved positions between the families in the active site FunFam-02-SAT (red) and FunFam-348-PPR (blue). The substrate molecule (AGS) in FunFam-02-SAT (1zunA) is shown in magenta. Chemically different residues highlighted in the substrate-binding site (H61V) and catalytic site (F174Y) of FunFam-348. d Catalytic mechanisms for the two Enzyme families.
Fig. 7
Fig. 7. Issues encountered when processing domains not assigned to CATH.
Each structure figure was generated using UCSF Chimera, with identifiers in the format UniProt_ID/start-stop. Examples of poor models a High proportion of unordered residues. b Presence of long unordered regions. c Residue packing problems. d Less than three secondary structures and packing problems.
Fig. 8
Fig. 8. New Structural Superfamilies. Each structure figure was generated using UCSF Chimera, with identifiers in the format UniProt_ID/start-stop.
a Meiotic recombination protein REC102. b Transmembrane protein 82. c T-cell activation inhibitor, mitochondrial.
Fig. 9
Fig. 9. Expansion in structural diversity in CATH by predicted AlphaFold structural models.
a Distribution of structural cluster sizes coloured by CATH class in CATH v4.3 and b expanded by AlphaFold structural models.

Comment in

References

    1. Suzek BE, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. - PMC - PubMed
    1. Lam SD, Das S, Sillitoe I, Orengo C. An overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences. Acta Crystallogr D. Struct. Biol. 2017;73:628–640. - PMC - PubMed
    1. Gromiha, M. M., Nagarajan, R. & Selvaraj, S. Encyclopedia of Bioinformatics and Computational Biology 445–459 (Elsevier, 2019).
    1. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. - PMC - PubMed
    1. Sen, N. et al. Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs. Brief. Bioinform.23, bbac187 (2022). - PMC - PubMed

Publication types