Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Apr 3;39(4):btad189.
doi: 10.1093/bioinformatics/btad189.

Structure-aware protein self-supervised learning

Affiliations

Structure-aware protein self-supervised learning

Can Sam Chen et al. Bioinformatics. .

Abstract

Motivation: Protein representation learning methods have shown great potential to many downstream tasks in biological applications. A few recent studies have demonstrated that the self-supervised learning is a promising solution to addressing insufficient labels of proteins, which is a major obstacle to effective protein representation learning. However, existing protein representation learning is usually pretrained on protein sequences without considering the important protein structural information.

Results: In this work, we propose a novel structure-aware protein self-supervised learning method to effectively capture structural information of proteins. In particular, a graph neural network model is pretrained to preserve the protein structural information with self-supervised tasks from a pairwise residue distance perspective and a dihedral angle perspective, respectively. Furthermore, we propose to leverage the available protein language model pretrained on protein sequences to enhance the self-supervised learning. Specifically, we identify the relation between the sequential information in the protein language model and the structural information in the specially designed graph neural network model via a novel pseudo bi-level optimization scheme. We conduct experiments on three downstream tasks: the binary classification into membrane/non-membrane proteins, the location classification into 10 cellular compartments, and the enzyme-catalyzed reaction classification into 384 EC numbers, and these experiments verify the effectiveness of our proposed method.

Availability and implementation: The Alphafold2 database is available in https://alphafold.ebi.ac.uk/. The PDB files are available in https://www.rcsb.org/. The downstream tasks are available in https://github.com/phermosilla/IEConv\_proteins/tree/master/Datasets. The code of the proposed method is available in https://github.com/GGchen1997/STEPS_Bioinformatics.

PubMed Disclaimer

Conflict of interest statement

None declared.

Financial Support: None declared.

Figures

Figure 1.
Figure 1.
Protein structure.
Figure 2.
Figure 2.
The dihedral angle ϕi and ψi.
Figure 3.
Figure 3.
Framework. The GNN model captures protein structural information with two self-supervised tasks: the pairwise distance prediction task and the dihedral angle prediction task. Furthermore, a pseudo bi-level optimization scheme identifies the relation between the protein LM and the GNN model by maximizing the mutual information, which enhances the self-supervised learning.
Figure 4.
Figure 4.
Ablation studies.

References

    1. Almagro Armenteros JJ, Sónderby CK, Sónderby SK. et al. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 2017;33:4049. - PubMed
    1. Anfinsen CB. Principles that govern the folding of protein chains. Science 1973;181:223–30. - PubMed
    1. Bepler T, Berger B. Learning protein sequence embeddings using information from structure. In: ICLR. 2019.
    1. Bepler T, Berger B.. Learning the protein language: evolution, structure, and function. Cell Syst 2021;12:654–69.e3. - PMC - PubMed
    1. Callaway E. Revolutionary cryo-EM is taking over structural biology. Nature 2020;578:201. - PubMed

Substances