Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Aug 20;34(7):1027-1035.
doi: 10.1101/gr.278870.123.

CodonBERT large language model for mRNA vaccines

Affiliations

CodonBERT large language model for mRNA vaccines

Sizhen Li et al. Genome Res. .

Abstract

mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Pretraining data distribution and CodonBERT model architecture. (A) Hierarchically classified mRNA sequences for pretraining. All the 14 leaf-level classes (those annotated with an asterisk are numbered). The angle of each segment is proportional to the number of sequences belonging to this group. (B) Model architecture and training scheme deployed for two tasks of CodonBERT. (C) A stack of 12 transformer blocks employed in CodonBERT model.
Figure 2.
Figure 2.
Genetic code and evolutionary taxonomy information learned by the pretrained, unsupervised CodonBERT model. High-dimensional embeddings were projected into two-dimensional space using UMAP (McInnes et al. 2018). (A,B) Projected codon embeddings from the pretrained CodonBERT model. Each point represents a codon with different contexts, and its color corresponds to the type of codon (A) or amino acid (B) accordingly. (C) Projected sequence embedding from the pretrained CodonBERT model. Each point is a mRNA sequence, and its color represents the sequence label. (D) Projected codon embedding from the pretrained Codon2vec model. Each point shows a codon, and its color is the corresponding amino acid.
Figure 3.
Figure 3.
Comparison to prior methods (TF-IDF, Codon2vec, RNABERT, and RNA-FM) and fine-tuning CodonBERT on downstream data sets. (A) Given an input corpus with m mRNA sequences, TF-IDF is used to construct a feature matrix followed by a random forest regression model. (B) Use a TextCNN model to learn task-specific nucleotide or codon representations. The model is able to fine-tune pretrained representations by initializing the embedding layers with stacked codon or nucleotide embeddings extracted from pretrained language models (Codon2vec, RNABERT, and RNA-FM). n is the number of codons in the input sequence, and d is the dimension of the token embedding. As a baseline, plain TextCNN initializes the embedding layer with a standard normal distribution. (C) Fine-tune the pretrained CodonBERT model on a given downstream task directly by keeping all the parameters trainable.

References

    1. Agarwal V, Kelley DR. 2022. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol 23: 245. 10.1186/s13059-022-02811-x - DOI - PMC - PubMed
    1. Agarwal V, Shendure J. 2020. Predicting mRNA abundance directly from genomic sequence using deep convolutional neural networks. Cell Rep 31: 107663. 10.1016/j.celrep.2020.107663 - DOI - PubMed
    1. Ahmad HI, Jabbar A, Mushtaq N, Javed Z, Hayyat MU, Bashir J, Naseeb I, Abideen ZU, Ahmad N, Chen J. 2022. Immune tolerance vs. immune resistance: the interaction between host and pathogens in infectious diseases. Front Vet Sci 9: 827407. 10.3389/fvets.2022.827407 - DOI - PMC - PubMed
    1. Akiyama M, Sakakibara Y. 2022. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genom Bioinform 4: lqac012. 10.1093/nargab/lqac012 - DOI - PMC - PubMed
    1. Al-Hawash AB, Zhang X, Ma F. 2017. Strategies of codon optimization for high-level heterologous protein expression in microbial expression systems. Gene Rep 9: 46–53. 10.1016/j.genrep.2017.08.006 - DOI

LinkOut - more resources