. 2023 Feb 8;6(1):160.

doi: 10.1038/s42003-023-04488-9.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Nicola Bordin¹, Ian Sillitoe¹, Vamsi Nallapareddy¹, Clemens Rauer¹, Su Datt Lam^{1

2}, Vaishali P Waman¹, Neeladri Sen¹, Michael Heinzinger³, Maria Littmann³, Stephanie Kim^{4

5}, Sameer Velankar⁶, Martin Steinegger^{4

5}, Burkhard Rost^{3

7

8}, Christine Orengo⁹

Affiliations

¹ Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
² Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.
³ TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr 3, 85748, Garching/Munich, Germany.
⁴ School of Biological Sciences, Seoul National University, Seoul, South Korea.
⁵ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
⁷ Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.
⁸ TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany.
⁹ Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK. c.orengo@ucl.ac.uk.

PMID: 36755055
PMCID: PMC9908985
DOI: 10.1038/s42003-023-04488-9

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Nicola Bordin et al. Commun Biol. 2023.

. 2023 Feb 8;6(1):160.

doi: 10.1038/s42003-023-04488-9.

Authors

Affiliations

¹ Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
² Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.
³ TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr 3, 85748, Garching/Munich, Germany.
⁴ School of Biological Sciences, Seoul National University, Seoul, South Korea.
⁵ Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
⁶ European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
⁷ Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany.
⁸ TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany.
⁹ Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK. c.orengo@ucl.ac.uk.

PMID: 36755055
PMCID: PMC9908985
DOI: 10.1038/s42003-023-04488-9

Abstract

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Overview of the CATH-Assign protocol used to process the predicted AF2 domains.**
CATH-HMM (labelled as CATH) are structurally compared against the Superfamily non-redundant representative that they match. Pfam and NewFam domains are classified into CATH Superfamilies using the CATHe predictor where possible. A cascade method is used to validate, starting with structure scans against non-redundant domains in the CATHe predicted Superfamily, then the predicted Topology, Architecture, and finally domains from all superfamilies if necessary.

**Fig. 2. Average model quality.**
The plots show the distribution of average pLDDT scores for domains divided by source. The pLDDT threshold for confident model quality is highlighted (≥70).

**Fig. 3. CATH coverage of the AlphaFold2 dataset.**
a Overview of domain quality and ontology for the total AlphaFold2 dataset and b subdivided by each proteome.

**Fig. 4. Structural coverage expansion.**
a Expansion in structural coverage by total number of structural domains and b fold-wise by validated CATH-HMM, Pfam and NewFams domains models for the 21 organisms in the AlphaFold2 dataset.

**Fig. 5. Structural coverage expansion of CATH FunFams by AlphaFold domains.**
Initial coverage by CATH/PDB (blue), additional coverage by AlphaFold2 models (orange) and unannotated (green).

**Fig. 6. Functional diversity between protein families revealed using Alphafold2: The HUP superfamily as an example.**
Members of HUP superfamily (CATH ID:3.40.50.620) possess a common structural core comprising a Rossmann αβα-sandwich fold. Several functional families within the HUP superfamily lacking a representative PDB structure now possess a representative domain from Alphafold2 enabling characterisation of putative functional sites in their associated functional families. For example, the Phosphoadenosine phosphosulfate reductase-like protein family (PPR) (CATH FunFam ID: FunFam-348; EC: 1.8.4.8) has a high-quality AF2 domain available for the poorly studied PPR protein from Leishmania infantum (af_A4I3B1_2_215; pLDDT: 94.87). This protein has no close homologue in the PDB (>30% sequence identity). We compared the AF2 model with the representative PDB structure (1zunA) from its closest Functional Family in CATH i.e., Sulphate adenylyltransferase family (SAT) [FunFam-02, representative PDB:1zunA; EC:2.7.7.4]. a Superposition of structure representatives from FunFam-348-PPR and FunFam-02-SAT. Residues conserved in both families are coloured green, FunFam-specific residues, blue (af_A4I3B1_2_215, for FunFam-348-PPR) and red (1zunA, for FunFam-02-SAT). b FunFam-2-SAT is involved in ATP hydrolysis, an essential process for its function. Analysis of conserved residues using Scorecons indicated that most active site residues (shown in red) are conserved differently between the two functional families. Moreover, there is a change in catalytic residue site (indicated as blue *) in PPR i.e., F(/L)209Y. The height of each residue indicates its degree of conservation. c Close-up view of differentially conserved positions between the families in the active site FunFam-02-SAT (red) and FunFam-348-PPR (blue). The substrate molecule (AGS) in FunFam-02-SAT (1zunA) is shown in magenta. Chemically different residues highlighted in the substrate-binding site (H61V) and catalytic site (F174Y) of FunFam-348. d Catalytic mechanisms for the two Enzyme families.

**Fig. 7. Issues encountered when processing domains not assigned to CATH.**
Each structure figure was generated using UCSF Chimera, with identifiers in the format UniProt_ID/start-stop. Examples of poor models a High proportion of unordered residues. b Presence of long unordered regions. c Residue packing problems. d Less than three secondary structures and packing problems.

**Fig. 8. New Structural Superfamilies. Each structure figure was generated using UCSF Chimera, with identifiers in the format UniProt_ID/start-stop.**
a Meiotic recombination protein REC102. b Transmembrane protein 82. c T-cell activation inhibitor, mitochondrial.

**Fig. 9. Expansion in structural diversity in CATH by predicted AlphaFold structural models.**
a Distribution of structural cluster sizes coloured by CATH class in CATH v4.3 and b expanded by AlphaFold structural models.

See this image and copyright information in PMC

Comment in

Predicted protein structures expand the CATH database.
Singh A. Singh A. Nat Methods. 2023 Apr;20(4):483. doi: 10.1038/s41592-023-01857-4. Nat Methods. 2023. PMID: 37046018 No abstract available.

References

1. Suzek BE, et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. - PMC - PubMed
1. Lam SD, Das S, Sillitoe I, Orengo C. An overview of comparative modelling and resources dedicated to large-scale modelling of genome sequences. Acta Crystallogr D. Struct. Biol. 2017;73:628–640. - PMC - PubMed
1. Gromiha, M. M., Nagarajan, R. & Selvaraj, S. Encyclopedia of Bioinformatics and Computational Biology 445–459 (Elsevier, 2019).
1. Chothia C, Lesk AM. The relation between the divergence of sequence and structure in proteins. EMBO J. 1986;5:823–826. - PMC - PubMed
1. Sen, N. et al. Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs. Brief. Bioinform.23, bbac187 (2022). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Affiliations

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources