Review

. 2023 Jan 18:12:e82819.

doi: 10.7554/eLife.82819.

Transformer-based deep learning for predicting protein properties in the life sciences

Abel Chandra¹, Laura Tünnermann², Tommy Löfstedt¹, Regina Gratz^{2

3}

Affiliations

¹ Department of Computing Science, Umeå University, Umeå, Sweden.
² Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden.
³ Department of Forest Ecology and Management, Swedish University of Agricultural Sciences, Umeå, Sweden.

PMID: 36651724
PMCID: PMC9848389
DOI: 10.7554/eLife.82819

Review

Transformer-based deep learning for predicting protein properties in the life sciences

Abel Chandra et al. Elife. 2023.

. 2023 Jan 18:12:e82819.

doi: 10.7554/eLife.82819.

Authors

Abel Chandra¹, Laura Tünnermann², Tommy Löfstedt¹, Regina Gratz^{2

3}

Affiliations

¹ Department of Computing Science, Umeå University, Umeå, Sweden.
² Umeå Plant Science Centre (UPSC), Department of Forest Genetics and Plant Physiology, Swedish University of Agricultural Sciences, Umeå, Sweden.
³ Department of Forest Ecology and Management, Swedish University of Agricultural Sciences, Umeå, Sweden.

PMID: 36651724
PMCID: PMC9848389
DOI: 10.7554/eLife.82819

Abstract

Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough in life science applications, in particular in protein property prediction. There is hope that deep learning can close the gap between the number of sequenced proteins and proteins with known properties based on lab experiments. Language models from the field of natural language processing have gained popularity for protein property predictions and have led to a new computational revolution in biology, where old prediction results are being improved regularly. Such models can learn useful multipurpose representations of proteins from large open repositories of protein sequences and can be used, for instance, to predict protein properties. The field of natural language processing is growing quickly because of developments in a class of models based on a particular model-the Transformer model. We review recent developments and the use of large-scale Transformer models in applications for predicting protein characteristics and how such models can be used to predict, for example, post-translational modifications. We review shortcomings of other deep learning models and explain how the Transformer models have quickly proven to be a very promising way to unravel information hidden in the sequences of amino acids.

Keywords: computational biology; deep learning; life sciences; machine learning; protein property prediction; systems biology; transformers.

PubMed Disclaimer

Conflict of interest statement

AC, LT, TL, RG No competing interests declared

Figures

**Figure 1.. Two common prediction tasks in machine learning (ML) are classification and regression.**
For illustration purpose, two-dimensional plots are used, but in reality, the dimensions are much higher. (A) Binary classification tasks are for samples that can be separated into two groups, called classes. For instance, the samples can be several features of some proteins, where each protein is associated with one of two classes. A protein variant could either be stable or unstable (Fang, 2020) or a lysine residue could be phosphoglycerylated or non-phosphoglycerylated (Chandra et al., 2020). The ML task would be to build a model that can determine the class for a new sample. (B) The multiclass classification task is performed when the proteins belong to one of multiple classes. For instance, predicting which structural class a protein belongs to Chou and Zhang, 1995. (C) The regression task is for applications where we want to predict real output values, for example, the brightness of a fluorescent protein (Lu et al., 2021).

**Figure 2.. Three well-known deep learning models.**
(A) Multilayer perceptrons (MLPs) are characterized by an input layer, several hidden layers, and an output layer. (B) Convolutional neural networks (CNNs) use convolution operations in their layers and learn filters that automatically extract features from the input sequences (e.g. from images, audio signals, time series, or protein sequences). At some point, the learned image features are strung out as a vector, called flattening, and are often passed on to fully connected layers at the end. (C) A recurrent neural network (RNNs) is a model that processes an input sequence step-by-step with one element in the sequence at a time.

**Figure 3.. Illustrations of embeddings and of next and masked token predictions.**
(A) An illustration of real-valued vector representations (input embeddings) of the tokens for a sample sentence. Each square represents a numerical value in the vector representation. The vector for each word in the sentence is obtained by looking up the unique ID attributed to the word with the ID in a vocabulary. Each word embedding is of the same size, called the embedding size, and they must be found in the vocabulary (in the illustration, the vocabulary size is 10,000 words). (B) The two main training approaches for protein language models, and specifically for Transformers. The top part illustrates autoregressive language modelling (predicting the next token), and the bottom part illustrates masked language modelling (predict a few missing, or masked tokens).

**Figure 4.. An illustration of sequence-to-sequence models and of how to use the internal representations for down-stream machine learning tasks.**
(A) The conceptual idea behind sequence-to-sequence models. The Transformer model by Vaswani et al., 2017 has a similar form, to map the input sequence to an output sequence using an encoder and a decoder. (B) An example application of the Transformer language model for protein property prediction. The input embedding is contextualized using the encoder block, which gives an internal representation, the model’s embedding of the input sequence. The internal representation is then used as features of the amino acids and can be passed in a second step to a machine learning model. The decoder block is not normally used after training since it does not serve much purpose in protein property prediction but is a critical component for training in natural language processing (NLP) applications such as language translation.

**Figure 5.. Visualisations of the attention weights in transformer models.**
(A) A visualization of the attention weights in a BERT model. The weights are from the first attention head of the eighth layer of the model. The model has a total of 12 layers and 12 attention heads. In this example, the model connects the words 'tom' and 'he' to the word 'sick' (darker lines indicate larger weights). Visualization inspired by BertViz (Vig, 2019a; https://github.com/jessevig/bertviz; Vig, 2022). (B) Attention weights visualization showing that a protein language model learned to put more weight from one residue onto four other residues in one layer. The shades of a particular colour (horizontal order) correspond to an attention head in the layer of the Transformer. Dark shades indicate stronger attention and are hence shown with darker lines connecting the tokens.

**Figure 6.. Yearly number of publications on Google Scholar for the years 2017–2021 and extrapolated count for the year 2022.**
For: (A) the search query 'Transformer Language Model' and (B) the search query 'Life Science' that have cited the original Transformer paper by Vaswani et al., 2017.

Figure 7.. The article counts for the three main disciplines (medicine, pharmacology, and biology) and percentage breakdown of their sub-categories in Google Scholar citing the 'Attention is all you need' paper by Vaswani et al., 2017.
The search was based on the query 'Life Science' and included all scientific research papers from 2017 to 2022 (cut-off on 2022-06-23).

See this image and copyright information in PMC

References

1. Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010;2:433–459. doi: 10.1098/rsta.2015.0202. - DOI
1. Alanis-Lobato G, Andrade-Navarro MA, Schaefer MH. HIPPIE v2. 0: enhancing meaningfulness and reliability of protein–protein interaction networks. Nucleic Acids Research. 2016;45:D408–D414. doi: 10.1093/nar/gkw985. - DOI - PMC - PubMed
1. Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. 2017 International Conference on Engineering and Technology (ICET; 2017. - DOI
1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
1. Ammari MG, Gresham CR, McCarthy FM, Nanduri B. HPIDB 2.0: a curated database for host–pathogen interactions. Database 2016 - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Transformer-based deep learning for predicting protein properties in the life sciences

Affiliations

Transformer-based deep learning for predicting protein properties in the life sciences

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous