Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Apr;48(4):345-359.
doi: 10.1016/j.tibs.2022.11.001. Epub 2022 Dec 9.

Novel machine learning approaches revolutionize protein knowledge

Affiliations
Review

Novel machine learning approaches revolutionize protein knowledge

Nicola Bordin et al. Trends Biochem Sci. 2023 Apr.

Abstract

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Keywords: AI; AlphaFold2; embeddings; machine learning; pLM; protein structure prediction; structure alignment.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests No interests are declared.

Figures

Figure 1
Figure 1
Overview of embeddings applications in protein structure and function characterization. Images were retrieved from Wikipedia (alpha helix and beta strand, binding sites), Creative Proteomics (cell structure), and bioRxiv (Transmembrane Regions - CASP13 Target T1008, Structure Prediction - PDB 1Q9F) with permission from the authors.
Figure 2
Figure 2
Comparison of search sensitivity and speed for language models, sequence/profile-profile and structure aligner. Average sensitivity up to the fifth false positive (x-axis) for family, superfamily, and fold measured on SCOP40e (version 2.01) [9] against average search time for a single query (y-axis) of 100 million proteins. Per SCOP40e domain we compute the fraction of detected true positives for family, superfamily, and fold up to the 5th false positive (FP) (= different fold), and plotted the average sensitivity over the domains (x-axis).
Figure 3
Figure 3
Visual analysis of the structure space spanned by CATH domains expanded by AlphaFold 2 (AF2) models. We showcase how distance in either structure (left) or embedding space (middle and right) can be used to gain insight into large sets of proteins. Simply put, we used pairwise distance between proteins to summarize ~850 000 protein domains in a single 2D plot and colored them according to their CATH class and architecture. This exemplifies a general-purpose tool for breaking down the complexity of large sets of proteins and allows, for example, detection of large-scale relationships that would otherwise be hard to find, or to detect outliers. More specifically, ~850 000 domains were structurally aligned using Foldseek [34] (left) in an all-versus-all fashion, resulting in a distance matrix based on the average pairwise bitscore within a superfamily as superfamily distances. The domain sequences were converted to embeddings using the ProtT5 (center) and ProtTucker (right) protein language models (pLMs). Similarly to the structural approach, the distance matrix between superfamilies were calculated using the average euclidean distance between embeddings belonging to different superfamilies. Using different modalities (i.e., structure and sequence embeddings) for computing distances on the same set of proteins, provides different, potentially orthogonal angles on the same problem which can be helpful during hypothesis generation. The resulting distance matrices were used as precomputed inputs for uniform manifold approximation and projection (UMAP) [121] and plotted with seaborn [122].
Figure 4
Figure 4
New folds in CATH-AlphaFold 2 (AF2). Examples of novel folds previously not encountered in CATH or Protein Data Bank (PDB). Structures are identified as novel folds if they have no significant structural similarity to domains or structures in the PDB using Foldseek as a comparison method. Each structure identifier is in the format UniProt_ID/start–stop with its current name in UniProt.

References

    1. wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47:D520–D528. - PMC - PubMed
    1. Liu J., Rost B. CHOP proteins into structural domain-like fragments. Proteins. 2004;55:678–688. - PubMed
    1. Orengo C.A., Thornton J.M. Protein families and their evolution—a structural perspective. Annu. Rev. Biochem. 2005;74:867–900. - PubMed
    1. Chothia C. Proteins. One thousand families for the molecular biologist. Nature. 1992;357:543–544. - PubMed
    1. Orengo C.A., et al. Protein superfamilies and domain superfolds. Nature. 1994;372:631–634. - PubMed

Publication types

LinkOut - more resources