Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2022 Nov 23:2022.10.10.511571.
doi: 10.1101/2022.10.10.511571.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Affiliations

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Maxim Zvyagin et al. bioRxiv. .

Abstract

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

Keywords: AI; COVID-19; HPC; Large language models; SARS-CoV-2; whole genome analyses.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
Overview of GenSLM models for predictive modeling of SARS-CoV-2 evolution. The inputs to GenSLM are nucleotide sequences, encoded at the codon level (every three nucleotide represents a codon; hence the 20 natural amino acid language is described by 64 codons). These inputs are successively fed into transformer blocks (referred to as layers (Li)), which ultimately results in learning a semantic embedding z space from which one may obtain the probability of any given sequence token p(XiX[N]\{i}), where N represents the sequence length and 𝑖 represents a particular position in the entire genome.
Figure 2:
Figure 2:
GenSLMs learned latent space describes biologically meaningful properties for SARS-CoV-2 genomes. (A) The embeddings from GenSLMs are visualized with t-distributed stochastic neighbor embedding (t-SNE) and each gene sequence is represented as a dot in the 2D plot. We paint each sequence by its variant ID – although we have more than 515 PANGO (Rambaut et al., 2020) lineages represented in the data, we only show those with WHO designated labels. (B) The latent space can also be painted with the MAFFT-determined alignment score (Yamada et al., 2016) with respect to an Omicron genome; clustering in the distance measures is clearly visible. Visualizing the sequence log-likelihood (blue bar) and the cross-protein attention (orange lines) from (C) Delta and (D) Omicron SARS-CoV-2 strains highlights how different the co-evolutionary patterns are in these lineages. It is interesting to note that while the Spike protein from Delta strain shows coupling to nsp3, nsp5, and other proteins, these couplings are not observed in the Omicron strain.
Figure 3:
Figure 3:
Illustration of diffusion-based hierarchical modeling. To predict a codon (such as TAA), we use both the previous codons within the context window (we use size 3 shown in green for illustration) and the high-level representations z.
Figure 4:
Figure 4:
Diffusion-based hierarchical modeling of SARS-CoV-2 genomes results in generation of sequences that captures the correct context of various open reading frames (ORFs). (A) Comparison of statistics measured on generated sequences and on real data for the ORFs. Diffusion-based hierarchical LM has a global high-level plan whereas the baseline can only take into account the previous 1023 codons. (B) Generated sequences (light blue) from the model overlaid on the phylogenetic tree demonstrate that these sequences are similar to observed strains.
Figure 5:
Figure 5:
Conceptual overview of our workflow. A “Thinker” orchestrates data flow between two applications, namely the sequence generator and the Bayesian optimization to drive the generated sequences towards a target property using reward-guided beam search, where μ represents the mixing constant used to balance the reward function against the log likelihood of generating the next token.
Figure 6:
Figure 6:
(A) Scaling results on Polaris and Selene systems for MSL=2048; (B) Scaling behavior of DDP vs. DeepSpeed runs on Selene (C) Scaling results on Polaris and Selene systems for MSL=10240;
Figure 7:
Figure 7:
Workflow utilization measured by the number of active workers (applications actively serving requests) as a function of workflow runtime measured on 224 nodes of Polaris (896 A100 GPUs). The warm-able application design realizes 97% utilization, enabling 1.9X more sequences to be generated compared to a cold start baseline.

References

    1. 2021. ProxyStore. https://github.com/proxystore/proxystore.
    1. Avsec Žiga, Agarwal Vikram, Visentin Daniel, Ledsam Joseph R, Grabska-Barwinska Agnieszka, Taylor Kyle R, Assael Yannis, Jumper John, Kohli Pushmeet, and Kelley David R. 2021. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods 18, 10 (2021), 1196–1203. - PMC - PubMed
    1. Babuji Yadu, Woodard Anna, Li Zhuozhao, Clifford Ben, Kumar Rohan, Lacinski Lukasz, Chard Ryan, Wozniak Justin, Foster Ian, Wilde Michael, Katz Daniel, and Chard Kyle. 2019. Parsl: Pervasive Parallel Programming in Python. In ACM International Symposium on High-Performance Parallel and Distributed Computing.
    1. Baker Jordan J., Mathy Christopher J. P., and Schaletzky Julia. 2021. A proposed workflow for proactive virus surveillance and prediction of variants for vaccine design. PLOS Computational Biology 17, 12 (12 2021), 1–12. 10.1371/journal.pcbi.1009624 - DOI - PMC - PubMed
    1. Balaprakash Prasanna, Salim Michael, Uram Thomas D., Vishwanath Venkat, and Wild Stefan M.. 2018. DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks. In 25th International Conference on High Performance Computing. IEEE. 10.1109/hipc.2018.00014 - DOI

Publication types