The language of proteins: NLP, machine learning & protein sequences
- PMID: 33897979
- PMCID: PMC8050421
- DOI: 10.1016/j.csbj.2021.03.022
The language of proteins: NLP, machine learning & protein sequences
Abstract
Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.
Keywords: Artificial neural networks; BERT; Bag of words; Bioinformatics; Contextualized embedding; Deep learning; Language models; Natural language processing; Tokenization; Transformer; Word embedding; Word2vec.
© 2021 The Author(s).
Conflict of interest statement
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Figures
References
-
- Almagro Armenteros, Jose Juan, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen. Language Modelling for Biological Sequences – Curated Datasets and Baselines. BioRxiv 2020. March, 2020.03.09.983585. 10.1101/2020.03.09.983585.
-
- Almagro Armenteros, José Juan, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen, and Ole Winther. 2017. “DeepLoc: Prediction of Protein Subcellular Localization Using Deep Learning.” Edited by John Hancock. Bioinformatics 33 (21): 3387–95. 10.1093/bioinformatics/btx431. - PubMed
Publication types
LinkOut - more resources
Full Text Sources
Other Literature Sources