CodonBERT large language model for mRNA vaccines

Sizhen Li^#¹, Saeed Moayedpour^#¹, Ruijiang Li¹, Michael Bailey¹, Saleh Riahi¹, Lorenzo Kogler-Anele¹, Milad Miladi², Jacob Miner², Fabien Pertuy², Dinghai Zheng², Jun Wang², Akshay Balsubramani², Khang Tran², Minnie Zacharia², Monica Wu², Xiaobo Gu², Ryan Clinton², Carla Asquith², Joseph Skaleski², Lianne Boeglin², Sudha Chivukula², Anusha Dias², Tod Strugnell², Fernando Ulloa Montoya³, Vikram Agarwal², Ziv Bar-Joseph⁴, Sven Jager¹

Affiliations

¹ Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA.
² mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA.
³ mRNA Center of Excellence, Sanofi, 69280 Marcy L'Etoile, France.
⁴ Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA; zivbj@cs.cmu.edu sven.jager@sanofi.com.

^# Contributed equally.

PMID: 38951026
PMCID: PMC11368176
DOI: 10.1101/gr.278870.123

CodonBERT large language model for mRNA vaccines

Sizhen Li et al. Genome Res. 2024.

. 2024 Aug 20;34(7):1027-1035.

doi: 10.1101/gr.278870.123.

Authors

Affiliations

¹ Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA.
² mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA.
³ mRNA Center of Excellence, Sanofi, 69280 Marcy L'Etoile, France.
⁴ Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA; zivbj@cs.cmu.edu sven.jager@sanofi.com.

^# Contributed equally.

PMID: 38951026
PMCID: PMC11368176
DOI: 10.1101/gr.278870.123

Abstract

mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.

PubMed Disclaimer

Figures

**Figure 1.**
Pretraining data distribution and CodonBERT model architecture. (A) Hierarchically classified mRNA sequences for pretraining. All the 14 leaf-level classes (those annotated with an asterisk are numbered). The angle of each segment is proportional to the number of sequences belonging to this group. (B) Model architecture and training scheme deployed for two tasks of CodonBERT. (C) A stack of 12 transformer blocks employed in CodonBERT model.

**Figure 2.**
Genetic code and evolutionary taxonomy information learned by the pretrained, unsupervised CodonBERT model. High-dimensional embeddings were projected into two-dimensional space using UMAP (McInnes et al. 2018). (A,B) Projected codon embeddings from the pretrained CodonBERT model. Each point represents a codon with different contexts, and its color corresponds to the type of codon (A) or amino acid (B) accordingly. (C) Projected sequence embedding from the pretrained CodonBERT model. Each point is a mRNA sequence, and its color represents the sequence label. (D) Projected codon embedding from the pretrained Codon2vec model. Each point shows a codon, and its color is the corresponding amino acid.

**Figure 3.**
Comparison to prior methods (TF-IDF, Codon2vec, RNABERT, and RNA-FM) and fine-tuning CodonBERT on downstream data sets. (A) Given an input corpus with m mRNA sequences, TF-IDF is used to construct a feature matrix followed by a random forest regression model. (B) Use a TextCNN model to learn task-specific nucleotide or codon representations. The model is able to fine-tune pretrained representations by initializing the embedding layers with stacked codon or nucleotide embeddings extracted from pretrained language models (Codon2vec, RNABERT, and RNA-FM). n is the number of codons in the input sequence, and d is the dimension of the token embedding. As a baseline, plain TextCNN initializes the embedding layer with a standard normal distribution. (C) Fine-tune the pretrained CodonBERT model on a given downstream task directly by keeping all the parameters trainable.

See this image and copyright information in PMC

References

1. Agarwal V, Kelley DR. 2022. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol 23: 245. 10.1186/s13059-022-02811-x - DOI - PMC - PubMed
1. Agarwal V, Shendure J. 2020. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep 31: 107663. 10.1016/j.celrep.2020.107663 - DOI - PubMed
1. Ahmad HI, Jabbar A, Mushtaq N, Javed Z, Hayyat MU, Bashir J, Naseeb I, Abideen ZU, Ahmad N, Chen J. 2022. Immune tolerance vs. immune resistance: the interaction between host and pathogens in infectious diseases. Front Vet Sci 9: 827407. 10.3389/fvets.2022.827407 - DOI - PMC - PubMed
1. Akiyama M, Sakakibara Y. 2022. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genom Bioinform 4: lqac012. 10.1093/nargab/lqac012 - DOI - PMC - PubMed
1. Al-Hawash AB, Zhang X, Ma F. 2017. Strategies of codon optimization for high-level heterologous protein expression in microbial expression systems. Gene Rep 9: 46–53. 10.1016/j.genrep.2017.08.006 - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CodonBERT large language model for mRNA vaccines

Affiliations

CodonBERT large language model for mRNA vaccines

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources