Generative prediction of real-world prevalent SARS-CoV-2 mutation with in silico virus evolution

Xudong Liu¹, Zhiwei Nie^{1

2}, Haorui Si^{3

4

5}, Xurui Shen⁵, Yutian Liu^{2

6}, Xiansong Huang², Tianyi Dong^{5

7

8}, Fan Xu², Zhixiang Ren², Peng Zhou^{3

5}, Jie Chen^{1

2}

Affiliations

¹ School of Electronic and Computer Engineering, Peking University, Shenzhen, China.
² Pengcheng Laboratory, Shenzhen, China.
³ Guangzhou Medical University, Guangzhou, China.
⁴ State Key Laboratory of Respiratory Disease, Guangzhou, China.
⁵ Guangzhou National Laboratory, Guangzhou, China.
⁶ School of Computer Science, Peking University, Beijing, China.
⁷ State Key Laboratory of Virology and Biosafety, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, China.
⁸ University of Chinese Academy of Sciences, Beijing, China.

PMID: 40532108
PMCID: PMC12204194
DOI: 10.1093/bib/bbaf276

Generative prediction of real-world prevalent SARS-CoV-2 mutation with in silico virus evolution

Xudong Liu et al. Brief Bioinform. 2025.

. 2025 May 1;26(3):bbaf276.

doi: 10.1093/bib/bbaf276.

Authors

Xudong Liu¹, Zhiwei Nie^{1

2}, Haorui Si^{3

4

5}, Xurui Shen⁵, Yutian Liu^{2

6}, Xiansong Huang², Tianyi Dong^{5

7

8}, Fan Xu², Zhixiang Ren², Peng Zhou^{3

5}, Jie Chen^{1

2}

Affiliations

¹ School of Electronic and Computer Engineering, Peking University, Shenzhen, China.
² Pengcheng Laboratory, Shenzhen, China.
³ Guangzhou Medical University, Guangzhou, China.
⁴ State Key Laboratory of Respiratory Disease, Guangzhou, China.
⁵ Guangzhou National Laboratory, Guangzhou, China.
⁶ School of Computer Science, Peking University, Beijing, China.
⁷ State Key Laboratory of Virology and Biosafety, Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, China.
⁸ University of Chinese Academy of Sciences, Beijing, China.

PMID: 40532108
PMCID: PMC12204194
DOI: 10.1093/bib/bbaf276

Abstract

Predicting the mutation prevalence trends of emerging viruses in the real world is an efficient means to update vaccines or drugs in advance. It is crucial to develop a computational method for the prediction of real-world prevalent SARS-CoV-2 mutations considering the impact of multiple selective pressures within and between hosts. Here, a deep-learning generative framework for real-world prevalent SARS-CoV-2 mutation prediction, named ViralForesight, is developed on top of protein language models and in silico virus evolution. Through the paradigm of host-to-herd in silico virus evolution, ViralForesight reproduced previous real-world prevalent SARS-CoV-2 mutations for multiple lineages with superior performance. More importantly, ViralForesight correctly predicted the future prevalent mutations that dominated the COVID-19 pandemic in the real world more than half a year in advance with in vitro experimental validation. Overall, ViralForesight demonstrates a proactive approach to the prevention of emerging viral infections, accelerating the process of discovering future prevalent mutations with the power of generative deep learning.

Keywords: generative deep learning; in silico virus evolution; mutation prediction; protein language model.

PubMed Disclaimer

Figures

**Figure 1**
Our motivation and methodology. (a) Illustration of the SARS-CoV-2 evolution from the host level to the herd level. After undergoing intra-host evolution, the variants spreads through the transmission bottleneck in a single transmission event (left panel), thus forming a large number of transmission chains, in which lineages with selective advantages become dominant (right panel). (b) Our host-to-herd selective pressure simulation strategy. Intra-host adaptive evolution of SARS-CoV-2 involves main host-level selective pressures including ACE2 binding affinity, expression, and antibody escape, among which host-level antibody escape is upgraded to herd-level immunity barrier to integrate herd-level selective pressures of SARS-CoV-2. (c) The methodology of our deep-learning generative prediction framework for potential prevalent SARS-CoV-2 mutations. Massive variants generated by the variant generator are subjected to host-to-herd selective pressure simulation, thereby recommending potential real-world prevalent mutations in the future in the vast evolutionary fitness landscape.

**Figure 2**
Module details. (a) PLM fine-tuning module, in which variant sequences of specific SARS-CoV-2 lineages carrying real-time evolutionary information are adopted to fine-tune the pretrained PLM. (b) Mutated-site-guided variant generation module, where the probability of each site being mutated in real-time SARS-CoV-2 evolutionary trajectory is used as the probability of each site being masked (where to mutate) for the prediction of residue type to be mutated to (how to mutate). (c) Host-to-herd selective pressure screening module, in which the *in silico* generated variant sequences are screened through the selective pressures of SARS-CoV-2 at host-level (expression prediction model) and herd-level (quantified antibody barrier model). (d) Illustration of the proposed quantified antibody barrier model. Based on the DMS data, monoclonal antibodies isolated from COVID-19 convalescent individuals are divided into multiple groups according to antigenic epitope, and the herd escape score of a variant is obtained by the average escape score of each of its mutations for each group.

**Figure 3**
Ablation experiments. (a) Ablation experiments for variant generation scale under two types of quantified antibody barrier models. The x-axis represents the number of generated variants with an interval of 50 000 and the y-axis represents the escape capability increment referring to the difference in the average herd escape score of the top K () variants sorted by scores in two consecutive generation experiments. (b) The remaining number of variants at different screening stages of three repeated experiments under two types of quantified antibody barrier models. “Stage I” refers to the stage of initial variant generation, “Stage II” refers to the stage where the variants are screened by the expression prediction model, and “Stage III” refers to the stage where the variants are further screened by the quantified antibody barrier model. (c) The PCC in mutation rankings across three repeated experiments. (d) The ranking trends of previous real-world prevalent mutations (target mutations) for BA.2.1 (left panel) and BA.5.1 (right panel) across different screening stages.

formula image — **Figure 3**
Ablation experiments. (a) Ablation experiments for variant generation scale under two types of quantified antibody barrier models. The x-axis represents the number of generated variants with an interval of 50 000 and the y-axis represents the escape capability increment referring to the difference in the average herd escape score of the top K () variants sorted by scores in two consecutive generation experiments. (b) The remaining number of variants at different screening stages of three repeated experiments under two types of quantified antibody barrier models. “Stage I” refers to the stage of initial variant generation, “Stage II” refers to the stage where the variants are screened by the expression prediction model, and “Stage III” refers to the stage where the variants are further screened by the quantified antibody barrier model. (c) The PCC in mutation rankings across three repeated experiments. (d) The ranking trends of previous real-world prevalent mutations (target mutations) for BA.2.1 (left panel) and BA.5.1 (right panel) across different screening stages.

**Figure 4**
Reproduction of previous real-world prevalent SARS-CoV-2 mutations. (a and b) Sequence logos of the predicted mutations of our ViralForesight (a) and the state-of-the-art method MLAEP (b) on the sites carrying previous real-world prevalent mutations with BA.2.1 (left panel) or BA.5.1 (right panel) as the starting lineage. A certain previous prevalent mutation ranked within the top 100 by ViralForesight or MLAEP is considered to be correctly predicted and colored in purple. For MLAEP, the sequence logos at sites other than R346 are enlarged for clear observation. (c) Ranking improvement of correctly predicted previous real-world prevalent mutations with BA.2.1 (left panel) or BA.5.1 (right panel) as the starting lineage, in which the mutations ranked after 100 are uniformly represented by “>100.” (d) Annotated phylogenetic tree, where previous real-world prevalent mutations correctly predicted by our ViralForesight are annotated on the corresponding lineages. The “x” suffix represents a collection of related lineages.

**Figure 5**
*In vitro* validation and real-world prevalence analysis of recommended SARS-CoV-2 mutations. (a) VSV-FLuc-based pseudotyped virus entry. Error bars revealed the standard deviation of the means from six biological repeats and the data were analyzed by Student’s t-test. (b) Neutralizing assay with convalescent sera. Neutralizing IC50 of sera samples to different spike-pseudotyped viruses was calculated by measuring the FLuc activity. (c) Summary of entry efficiency and serum dilution. Entry efficiency (upper panel) and serum dilution to reach IC50 (lower panel) of predicted mutated spike pseudoviruses are compared with original XBB.1.5 spike pseudovirus. For entry assay, yellow color means higher viral infectivity, and for neutralizing assay, yellow color means stronger immune evasion. (d) Real-world prevalence analysis of recommended representative T478E mutation, including the mutation prevalence of T478E from outbreak.info and GISAID at June 2024 (left panel) and the proportion of detected variants from CDC COVID Data Tracker at the collection week of June 24, 2024 (right panel).

See this image and copyright information in PMC

References

1. Zhou P, Yang XL, Wang XG. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 2020;579:270–3. 10.1038/s41586-020-2012-7 - DOI - PMC - PubMed
1. Hu B, Guo H, Zhou P. et al. Characteristics of SARS-CoV-2 and COVID-19. Nat Rev Microbiol 2021;19:141–54. 10.1038/s41579-020-00459-7 - DOI - PMC - PubMed
1. Markov PV, Ghafari M, Beer M. et al. The evolution of SARS-CoV-2. Nat Rev Microbiol 2023;21:361–79. 10.1038/s41579-023-00878-2 - DOI - PubMed
1. Greaney AJ, Starr TN, Barnes CO. et al. Mapping mutations to the SARS-CoV-2 RBD that escape binding by different classes of antibodies. Nat Commun 2021;12:4196. 10.1038/s41467-021-24435-8 - DOI - PMC - PubMed
1. Starr TN, Greaney AJ, Hannon WW. et al. Shifting mutational constraints in the SARS-CoV-2 receptor-binding domain during viral evolution. Science 2022;377:420–4. 10.1126/science.abo7896 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generative prediction of real-world prevalent SARS-CoV-2 mutation with in silico virus evolution

Affiliations

Generative prediction of real-world prevalent SARS-CoV-2 mutation with in silico virus evolution

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous