Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Mar 25:19:1750-1758.
doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

The language of proteins: NLP, machine learning & protein sequences

Affiliations
Review

The language of proteins: NLP, machine learning & protein sequences

Dan Ofer et al. Comput Struct Biotechnol J. .

Abstract

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

Keywords: Artificial neural networks; BERT; Bag of words; Bioinformatics; Contextualized embedding; Deep learning; Language models; Natural language processing; Tokenization; Transformer; Word embedding; Word2vec.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1
Fig. 1
Computational analysis of natural language and proteins (A) Texts and proteins can be represented as strings of letters and processed with NLP methods to study local and global properties. (B) A common preprocessing step in NLP is the tokenization of text or protein sequences into distinct tokens, which are the atomic units of information. There are many different ways to tokenize text, e.g. as letters, words, or other substring pieces of equal or unequal length. (C) Bag-of-words representation can be used to count unique tokens in a text, turning every input text into a fixed-size vector. Subsequently, these vector representations can be analyzed through any machine-learning algorithm.
Fig. 2
Fig. 2
Language models (A) Language models are trained on self-supervised tasks over huge corpuses of unlabeled text. For example, in the masked language task, some fraction of the tokens in the original text are masked at random, and the language model attempts to predict the original text. (B) (Pre-)trained language models are commonly fine-tuned on downstream tasks over labeled text, through a standard supervised-learning approach. Fine-tuning is typically much faster and provides superior performance than training a model from scratch, especially when labeled data is scarce.

References

    1. Akhtar Malik N., Southey Bruce R., Andrén Per E., Sweedler Jonathan V., Rodriguez-Zas Sandra L. Evaluation of Database Search Programs for Accurate Detection of Neuropeptides in Tandem Mass Spectrometry Experiments. J Proteome Res. 2012;11(12):6044–6055. doi: 10.1021/pr3007123. - DOI - PMC - PubMed
    1. Allam Ahmed, Nagy Mate, Thoma George, Krauthammer Michael. Neural networks versus logistic regression for 30 days all-cause readmission prediction. Sci Rep. 2019;9(1):9277. doi: 10.1038/s41598-019-45685-z. - DOI - PMC - PubMed
    1. Alley Ethan C., Khimulya Grigory, Biswas Surojit, AlQuraishi Mohammed, Church George M. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–1322. doi: 10.1038/s41592-019-0598-1. - DOI - PMC - PubMed
    1. Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585.
    1. Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431. - PubMed