Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Sep 22;24(6):bbad358.
doi: 10.1093/bib/bbad358.

Generative models for protein sequence modeling: recent advances and future directions

Affiliations
Review

Generative models for protein sequence modeling: recent advances and future directions

Mehrsa Mardikoraem et al. Brief Bioinform. .

Abstract

The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field.

Keywords: diffusion models; generative adversarial neural networks (GANs); generative machine learning (ML) models; natural language processing (NLP); protein engineering; variational autoencoders (VAE).

PubMed Disclaimer

Figures

Figure 1
Figure 1
A diverse set of protein engineering applications benefit from the generative and discriminative potential of sequence models. These applications include stability, solubility, bioluminescence, binding capacity, phylogeny, gene ontology and protein localization. The schematic of the sequence models are represented here but their detailed description will be elaborated in their corresponding sections. The autoregressive model forecasts future values based on the previous values in the time series data. VAE is probabilistic modeling architecture that contains an encoder (E) and a decoder (D), compressing high-dimensional input data with the E and reconstructing the data from hidden dimension by D. This architecture along with variational inference techniques will lead to learning given data distribution and generating novel instances. In GANs, G represents a generative model that aims to generate realistic data from noise input, and D is a discriminator that acts as a critic to distinguish the real data from model-generated data. Diffusion models are a relatively new generative model that facilitates the generation of novel samples from a state of maximum randomness (at point XT) that is previously generated through the iterative addition of random noise to data distribution. These models have been demonstrated in diverse applications ranging from antibody binding prediction to protein localization prediction tasks in addition to novel protein sequence generation tasks.
Figure 2
Figure 2
The architecture of generative recurrent neural networks versus autoregressive model. (A) An autoregressive model (AR) has a similar structure to a recurrent neural network (RNN). However, while RNN only depends on the current time step, AR utilizes information from the previous time steps as well as the current time step to predict the next token. (B) Two important RNN architectures for resolving vanishing gradient problems in training sequence data are LSTM and GRU. These networks contain gates to control the information flow. LSTM contains three gates: input gate, forget gate and output gate and GRU contains two gates: reset gate and update gate. Note that C indicates the cell state, and h is the hidden state in shown architectures.
Figure 3
Figure 3
Visualization of attention mapping and attention computation. (A) Based on the protein fold, amino acids in different positions have varied epistitatic effects on each other. The highlighted circle refers to a query amino acid in a protein active site. The color gradient shows how attention can capture the influence of other amino acids (tokens) on the queried token. (B) Attention computation requires three components: key, query and value. By calculating scaled dot-product attention scores, the model chooses which areas of the sequence it needs to prioritize for the prediction task.
Figure 4
Figure 4
Architecture overview for the transformer and two important transformer-based language models: Bidirectional Encoder Representation from Transformers (BERT) and Generative Pre-Training (GPT). The transformer utilizes an encoder-decoder method for handling language tasks. However, BERT uses encoder blocks only and GPT only includes decoder blocks. The difference in their architecture is mainly due to their training objective. In pretraining, BERT takes a bidirectional approach while GPT is based on an autoregressive method.
Figure 5
Figure 5
Sequence probabilistic modeling is feasible via encoder-decoder architecture and variational inference. A parameterized distribution function is determined for the given sequence data in which new sequences get generated by sampling from the learned distribution. VAE architecture consists of an encoder q(z|x) to map the input from x to z and a decoder p(x|z) to map the data from z back to x.
Figure 6
Figure 6
GANS architecture for generating sequence data; a model that learns to sample from the given data distribution which contains two separate and opposed networks: generator and discriminator. The generator aims to generate synthetic data from noise which can’t be distinguished from the real data by the discriminator. The discriminator on the other hand gets optimized to identify synthetic data from the real data. Evolving together, the model finally will be able to generate samples very similar to the real training data.

References

    1. Webster JM, Zhang R, Gambhir SS, et al. Engineered two-helix small proteins for molecular recognition. Chem Bio Chem 2009;10:1293–6. - PubMed
    1. Eke CS, Jammeh E, Li X, et al. Early detection of Alzheimer’s disease with blood plasma proteins using support vector machines. IEEE J Biomed Health Inform 2021;25:218–26. - PubMed
    1. Luan Y, Yao Y. The clinical significance and potential role of C-reactive protein in chronic inflammatory and neurodegenerative diseases. Front Immunol 2018;9:1302. - PMC - PubMed
    1. Bam R, Lown PS, Stern LA, et al. Efficacy of Affibody-based ultrasound molecular imaging of vascular B7-H3 for breast cancer detection. Clin Cancer Res 2020;26:2140–50. - PMC - PubMed
    1. Małecki J, Muszyński S, Sołowiej BG. Proteins in food systems—bionanomaterials, conventional and unconventional sources, functional properties, and development opportunities. Polymers 2021;13:2506. - PMC - PubMed

Publication types