Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Nov 18;12(11):1709.
doi: 10.3390/biom12111709.

GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction

Affiliations

GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction

Anowarul Kabir et al. Biomolecules. .

Abstract

Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.

Keywords: gene ontology; multi-modal transformer; protein function.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
The figure shows the proposed multi-modal Transformer model architecture. (a) depicts a black-box protein sequence modeling encoder. (b) shows the GO terms representation module using the Transformer encoder architecture.
Figure 2
Figure 2
The figures show the distribution of the number of labels per protein for the BP, CC and MF classes in the train-, val-, and test-set, respectively, considering the cutoff values of 150, 25, and 25. The test-set is generated considering the TDNK split, and the val-set is generated by randomly sampling 10% of the full dataset.
Figure 3
Figure 3
The train binary cross-entropy loss, validation binary cross-entropy loss and validation Fmax are shown for BP on each of the three dataset settings.
Figure 3
Figure 3
The train binary cross-entropy loss, validation binary cross-entropy loss and validation Fmax are shown for BP on each of the three dataset settings.
Figure 4
Figure 4
The train binary cross-entropy loss, validation binary cross-entropy loss and validation Fmax are shown for CC on each of the three dataset settings.
Figure 5
Figure 5
The train binary cross-entropy loss, validation binary cross-entropy loss and validation Fmax are shown for MF on each of the three dataset settings.

References

    1. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. Attention Is All You Need. arXiv. 20171706.03762
    1. Heinzinger M., Elnaggar A., Wang Y., Dallago C., Nechaev D., Matthes F., Rost B. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform. 2019;20:723. doi: 10.1186/s12859-019-3220-8. - DOI - PMC - PubMed
    1. Bepler T., Berger B. Learning the protein language: Evolution, structure, and function. Cell Syst. 2021;12:654–669.e3. doi: 10.1016/j.cels.2021.05.017. - DOI - PMC - PubMed
    1. Elnaggar A., Heinzinger M., Dallago C., Rihawi G., Wang Y., Jones L., Gibbs T., Feher T., Angerer C., Steinegger M., et al. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing. IEEE Trans. Patern Anal. Mach. Intell. 2021;44:7112–7127. doi: 10.1109/TPAMI.2021.3095381. - DOI - PubMed
    1. Stärk H., Dallago C., Heinzinger M., Rost B. Light attention predicts protein location from the language of life. Bioinform. Adv. 2021;1:vbab035. doi: 10.1093/bioadv/vbab035. - DOI - PMC - PubMed

Publication types

LinkOut - more resources