This is a preprint.
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics
- PMID: 36451881
- PMCID: PMC9709791
- DOI: 10.1101/2022.10.10.511571
GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics
Abstract
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
Keywords: AI; COVID-19; HPC; Large language models; SARS-CoV-2; whole genome analyses.
Figures
References
-
- 2021. ProxyStore. https://github.com/proxystore/proxystore.
-
- Avsec Žiga, Agarwal Vikram, Visentin Daniel, Ledsam Joseph R, Grabska-Barwinska Agnieszka, Taylor Kyle R, Assael Yannis, Jumper John, Kohli Pushmeet, and Kelley David R. 2021. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods 18, 10 (2021), 1196–1203. - PMC - PubMed
-
- Babuji Yadu, Woodard Anna, Li Zhuozhao, Clifford Ben, Kumar Rohan, Lacinski Lukasz, Chard Ryan, Wozniak Justin, Foster Ian, Wilde Michael, Katz Daniel, and Chard Kyle. 2019. Parsl: Pervasive Parallel Programming in Python. In ACM International Symposium on High-Performance Parallel and Distributed Computing.
-
- Balaprakash Prasanna, Salim Michael, Uram Thomas D., Vishwanath Venkat, and Wild Stefan M.. 2018. DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks. In 25th International Conference on High Performance Computing. IEEE. 10.1109/hipc.2018.00014 - DOI
Publication types
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous