. 2022 Apr 12;38(8):2102-2110.

doi: 10.1093/bioinformatics/btac020.

ProteinBERT: a universal deep-learning model of protein sequence and function

Nadav Brandes¹, Dan Ofer², Yam Peleg³, Nadav Rappoport⁴, Michal Linial²

Affiliations

¹ School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.
² Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.
³ Deep Trading Ltd., Haifa 3508401, Israel.
⁴ Department of Software and Information Systems Engineering, Faculty of Engineering Sciences, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel.

PMID: 35020807
PMCID: PMC9386727
DOI: 10.1093/bioinformatics/btac020

ProteinBERT: a universal deep-learning model of protein sequence and function

Nadav Brandes et al. Bioinformatics. 2022.

. 2022 Apr 12;38(8):2102-2110.

doi: 10.1093/bioinformatics/btac020.

Authors

Nadav Brandes¹, Dan Ofer², Yam Peleg³, Nadav Rappoport⁴, Michal Linial²

Affiliations

¹ School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.
² Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.
³ Deep Trading Ltd., Haifa 3508401, Israel.
⁴ Department of Software and Information Systems Engineering, Faculty of Engineering Sciences, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel.

PMID: 35020807
PMCID: PMC9386727
DOI: 10.1093/bioinformatics/btac020

Abstract

Summary: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data.

Availability and implementation: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The ProteinBERT architecture. ProteinBERT’s architecture is inspired by BERT. Unlike standard Transformers, ProteinBERT supports both local (sequential) and global data. The model consists of six transformer-like blocks manipulating local (left side) and global (right side) representations. Each such block manipulates these representations by fully connected and convolutional layers (in the case of local representations), with skip connections and normalization layers between them. The local representations affect the global representations through a global attention layer, and the global representations affect the local representations through a broadcast fully connected layer

**Fig. 2.**
Pretraining loss. Training-set loss over the two pretraining tasks: (i) protein sequence language modeling, and (ii) GO annotation recovery. Losses were evaluated with input sequence length of 128, 512 or 1024 tokens on the first 100 batches of the dataset

**Fig. 3.**
The impact of pretraining on downstream tasks. Performance of fine-tuned ProteinBERT models over the four TAPE benchmarks as a function of pretraining amount (measured by the number of processed proteins). Similar plots for all nine benchmarks are shown in Supplementary Figure S1

**Fig. 4.**
Performance across sequence lengths. Test-set performance of fine-tuned ProteinBERT models with different input sequence lengths. Sequence lengths (e.g. 512, 1024, etc.) always encode proteins of shorter lengths (e.g. a protein of 700 residues will be encoded as a 1024-long sequence). Boxplot distributions are over the 371 pretraining snapshots used in Figure 3

**Fig. 5.**
Global attention before and after fine-tuning on signal peptide prediction. Global attention values obtained for two selected proteins: Outer membrane protein P.IIC (piiC) in neisseria gonorrhoeae (top), and Gamma carbonic anhydrase-like 2, mitochondrial protein (GAMMACAL2) in arabidopsis (bottom). piiC has a signal peptide at positions 1–25 (ending with the amino-acid sequence SAARA). GAMMACAL2 has no signal peptide. The left panels (red colors) show the attention values obtained by the generic ProteinBERT model, after pretraining, it as a language model on UniRef90 (but before fine-tuning it on any specific task). The heatmap shows the global attention values at each residue of the protein by each of the 24 attention heads of the model. The bar plot shows the total attention at each residue by summing the attention values across all attention heads. The right panels show the difference in attention values after fine-tuning ProteinBERT on the signal peptide task. The heatmap shows the increase (green) or decrease (purple) of attention across all positions and attention heads. The bar plot shows the total difference in attention at each residue by summing the differences across all attention heads. Note that, each attention head necessarily sums up to 100%. Accordingly, differences sum up to 0%

See this image and copyright information in PMC

References

1. Abadi M. et al. (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI16). pp. 265–283.
1. Alley E.C. et al. (2019) Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods, 16, 1315–1322. - PMC - PubMed
1. Altschul S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST : a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Ashburner M. et al. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 25–29. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

2753/20/Israel Science Foundation (ISF)

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ProteinBERT: a universal deep-learning model of protein sequence and function

Affiliations

ProteinBERT: a universal deep-learning model of protein sequence and function

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources