. 2025 May 13;16(1):4236.

doi: 10.1038/s41467-025-59422-w.

A protein language model for exploring viral fitness landscapes

Jumpei Ito^{1

2}, Adam Strange³, Wei Liu^#^{3

4

5}, Gustav Joas^#^{3

6}, Spyros Lytras^{3

7}; Genotype to Phenotype Japan (G2P-Japan) Consortium; Kei Sato^{8

9

10

11

12

13}

Collaborators, Affiliations

Collaborators

Genotype to Phenotype Japan (G2P-Japan) Consortium:
Keita Matsuno, Naganori Nao, Hirofumi Sawa, Keita Mizuma, Isshu Kojima, Jingshu Li, Tomoya Tsubo, Shinya Tanaka, Masumi Tsuda, Lei Wang, Yoshikata Oda, Zannatul Ferdous, Kenji Shishido, Takasuke Fukuhara, Tomokazu Tamura, Rigel Suzuki, Saori Suzuki, Shuhei Tsujino, Hayato Ito, Yu Kaku, Naoko Misawa, Arnon Plianchaisuk, Ziyi Guo, Alfredo A Hinay Jr, Kaoru Usui, Wilaiporn Saikruang, Keiya Uriu, Yusuke Kosugi, Shigeru Fujita, Jarel Elgin M Tolentino, Luo Chen, Lin Pan, Wenye Li, Mai Suganami, Mika Chiba, Ryo Yoshimura, Kyoko Yasuda, Keiko Iida, Naomi Ohsumi, Shiho Tanaka, Kaho Okumura, Kazuhisa Yoshimura, Kenji Sadamas, Mami Nagashima, Hiroyuki Asakura, Isao Yoshida, So Nakagawa, Akifumi Takaori-Kondo, Kotaro Shirakawa, Kayoko Nagata, Ryosuke Nomura, Yoshihito Horisawa, Yusuke Tashiro, Yugo Kawai, Kazuo Takayama, Rina Hashimoto, Sayaka Deguchi, Yukio Watanabe, Yoshitaka Nakata, Hiroki Futatsusako, Ayaka Sakamoto, Naoko Yasuhara, Takao Hashiguchi, Tateki Suzuki, Kanako Kimura, Jiei Sasaki, Yukari Nakajima, Hisano Yajima, Takashi Irie, Ryoko Kawabata, Kaori Sasaki-Tabata, Terumasa Ikeda, Hesham Nasse, Ryo Shimizu, Mst Monira Begum, Michael Jonathan, Yuka Mugita, Sharee Leong, Otowa Takahashi, Kimiko Ichihara, Takamasa Ueno, Chihiro Motozono, Mako Toyoda, Akatsuki Saito, Maya Shofa, Yuki Shibatani, Tomoko Nishiuchi, Jiri Zahradni, Prokopios Andrikopoulos, Miguel Padilla-Blanco, Aditi Konar

Affiliations

¹ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. jampei@g.ecc.u-tokyo.ac.jp.
² International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. jampei@g.ecc.u-tokyo.ac.jp.
³ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan.
⁴ Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.
⁵ Swiss Institute of Bioinformatics, Geneva, Switzerland.
⁶ Division of Immunology and Respiratory Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden.
⁷ MRC-University of Glasgow Centre for Virus Research, Glasgow, UK.
⁸ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
⁹ International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹⁰ MRC-University of Glasgow Centre for Virus Research, Glasgow, UK. KeiSato@g.ecc.u-tokyo.ac.jp.
¹¹ Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹² International Vaccine Design Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹³ Collaboration Unit for Infection, Joint Research Center for Human Retrovirus Infection, Kumamoto University, Kumamoto, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.

^# Contributed equally.

PMID: 40360496
PMCID: PMC12075601
DOI: 10.1038/s41467-025-59422-w

A protein language model for exploring viral fitness landscapes

Jumpei Ito et al. Nat Commun. 2025.

. 2025 May 13;16(1):4236.

doi: 10.1038/s41467-025-59422-w.

Authors

Jumpei Ito^{1

2}, Adam Strange³, Wei Liu^#^{3

4

5}, Gustav Joas^#^{3

6}, Spyros Lytras^{3

7}; Genotype to Phenotype Japan (G2P-Japan) Consortium; Kei Sato^{8

9

10

11

12

13}

Collaborators

Genotype to Phenotype Japan (G2P-Japan) Consortium:
Keita Matsuno, Naganori Nao, Hirofumi Sawa, Keita Mizuma, Isshu Kojima, Jingshu Li, Tomoya Tsubo, Shinya Tanaka, Masumi Tsuda, Lei Wang, Yoshikata Oda, Zannatul Ferdous, Kenji Shishido, Takasuke Fukuhara, Tomokazu Tamura, Rigel Suzuki, Saori Suzuki, Shuhei Tsujino, Hayato Ito, Yu Kaku, Naoko Misawa, Arnon Plianchaisuk, Ziyi Guo, Alfredo A Hinay Jr, Kaoru Usui, Wilaiporn Saikruang, Keiya Uriu, Yusuke Kosugi, Shigeru Fujita, Jarel Elgin M Tolentino, Luo Chen, Lin Pan, Wenye Li, Mai Suganami, Mika Chiba, Ryo Yoshimura, Kyoko Yasuda, Keiko Iida, Naomi Ohsumi, Shiho Tanaka, Kaho Okumura, Kazuhisa Yoshimura, Kenji Sadamas, Mami Nagashima, Hiroyuki Asakura, Isao Yoshida, So Nakagawa, Akifumi Takaori-Kondo, Kotaro Shirakawa, Kayoko Nagata, Ryosuke Nomura, Yoshihito Horisawa, Yusuke Tashiro, Yugo Kawai, Kazuo Takayama, Rina Hashimoto, Sayaka Deguchi, Yukio Watanabe, Yoshitaka Nakata, Hiroki Futatsusako, Ayaka Sakamoto, Naoko Yasuhara, Takao Hashiguchi, Tateki Suzuki, Kanako Kimura, Jiei Sasaki, Yukari Nakajima, Hisano Yajima, Takashi Irie, Ryoko Kawabata, Kaori Sasaki-Tabata, Terumasa Ikeda, Hesham Nasse, Ryo Shimizu, Mst Monira Begum, Michael Jonathan, Yuka Mugita, Sharee Leong, Otowa Takahashi, Kimiko Ichihara, Takamasa Ueno, Chihiro Motozono, Mako Toyoda, Akatsuki Saito, Maya Shofa, Yuki Shibatani, Tomoko Nishiuchi, Jiri Zahradni, Prokopios Andrikopoulos, Miguel Padilla-Blanco, Aditi Konar

Affiliations

¹ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. jampei@g.ecc.u-tokyo.ac.jp.
² International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. jampei@g.ecc.u-tokyo.ac.jp.
³ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan.
⁴ Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland.
⁵ Swiss Institute of Bioinformatics, Geneva, Switzerland.
⁶ Division of Immunology and Respiratory Medicine, Department of Medicine, Karolinska Institutet, Stockholm, Sweden.
⁷ MRC-University of Glasgow Centre for Virus Research, Glasgow, UK.
⁸ Division of Systems Virology, Department of Microbiology and Immunology, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
⁹ International Research Center for Infectious Diseases, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹⁰ MRC-University of Glasgow Centre for Virus Research, Glasgow, UK. KeiSato@g.ecc.u-tokyo.ac.jp.
¹¹ Graduate School of Medicine, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹² International Vaccine Design Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.
¹³ Collaboration Unit for Infection, Joint Research Center for Human Retrovirus Infection, Kumamoto University, Kumamoto, Japan. KeiSato@g.ecc.u-tokyo.ac.jp.

^# Contributed equally.

PMID: 40360496
PMCID: PMC12075601
DOI: 10.1038/s41467-025-59422-w

Abstract

Successively emerging SARS-CoV-2 variants lead to repeated epidemic surges through escalated fitness (i.e., relative effective reproduction number between variants). Modeling the genotype-fitness relationship enables us to pinpoint the mutations boosting viral fitness and flag high-risk variants immediately after their detection. Here, we present CoVFit, a protein language model adapted from ESM-2, designed to predict variant fitness based solely on spike protein sequences. CoVFit was trained on genotype-fitness data derived from viral genome surveillance and functional mutation assays related to immune evasion. CoVFit successively ranked the fitness of unknown future variants harboring nearly 15 mutations with informative accuracy. CoVFit identified 959 fitness elevation events throughout SARS-CoV-2 evolution until late 2023. Furthermore, we show that CoVFit is applicable for predicting viral evolution through single amino acid mutations. Our study gives insight into the SARS-CoV-2 fitness landscape and provides a tool for efficiently identifying SARS-CoV-2 variants with higher epidemic risk.

PubMed Disclaimer

Conflict of interest statement

Competing interests: J.I. has consulting fees and honoraria for lectures from Takeda Pharmaceutical Co. Ltd Spyros Lytras has consulting fees from EcoHealth Alliance. K.S. has consulting fees from Moderna Japan Co., Ltd and Takeda Pharmaceutical Co. Ltd, and honoraria for lectures from Gilead Sciences, Inc., Moderna Japan Co., Ltd, and Shionogi & Co., Ltd. The other authors declare no competing interests.

Figures

**Fig. 1. Overview of CoVFit.**
a Conceptual framework of CoVFit. CoVFit is a protein language model designed to predict the fitness (relative R_e) of SARS-CoV-2 variants based on their S protein sequences. b Outline of the training process used to develop CoVFit model instances.

**Fig. 2. Prediction performance of CoVFit.**
a Spearman’s correlation scores for predicted relative fitness values and mAb neutralization escape scores. Scores from five cross-validation folds are shown as dots, with the mean represented by a bar and the standard deviation by an error bar. The correlation for mAbs was calculated in each epitope group. b Scatter plot for fitness prediction, aggregating results from five-fold cross-validation. Dot denotes the result of a certain viral genotype in a specific country. Dot is colored by the Nextclade clade. The relative fitness value was scaled so that the 0.1 percentile and 99.9 percentile points fall between 0 and 1. A dashed line with a slope 1 and intercept 0 is shown. c Scatter plot inherited from (b) but colored by the emergence date of each genotype. Source data are provided as a Source Data file.

**Fig. 3. Prediction performance of CoVFit for unknown, future variants.**
a Strategy for evaluating prediction performance on future variants. Model instances, referred to as CoVFit_Past, were trained on variant data prior to a specified cutoff date (e.g., January 31, 2022). Prediction performance for future variants was then assessed using data from variants that emerged after this date. b Number of sequences from each clade in the past datasets with specific cutoff dates. c Fitness predictions for future (gray) and past (light gray) variants in the dataset with a cutoff date of February 28, 2022. Points represent results for each genotype, calculated as average values across countries and five-fold predictions. A dashed line with a slope of 1 and an intercept of 0 is included. d Fitness predictions for future variants, with colors indicating Nextclade clade classifications. In addition to the dashed line with a slope of 1 and intercept 0, a gray estimated regression line, based on mean prediction values, is displayed. e Scatter plot based on (d) but colored according to the minimum amino acid distance from variants in the past data. f Predicted fitness of genotypes within each Nextclade clade. Each clade’s distribution (violin) and median value (dot) are shown. Individual panels display results for datasets with different cutoff dates. Clades present in the past data are separated by a dashed vertical line from those absent in the past data. Additionally, the median observed fitness value of each clade is represented by a heatmap on the left side. g Comparison of prediction performance metrics across methods, including Spearman’s correlation score, R-squared value, mean absolute error (MAE), and estimated regression slope. Source data are provided as a Source Data file.

**Fig. 4. Detection of fitness elevation events during Omicron diversification.**
a Scheme to detect phylogenetic branches with fitness elevation utilizing CoVFit models. b Inference of change in fitness through Omicron’s evolution. The maximum likelihood (ML) tree of Omicron lineages is shown. Branch color indicates an inferred fitness value for each phylogenetic node, including both observed and reconstructed ancestral genotypes of S proteins in the phylogenetic tree. c Detection of fitness elevation events during Omicron’s evolution. Dot color indicates inferred fitness gain in each branch, calculated as the difference in predicted fitness between a node and its parental node. d Mean fitness gain over a specific mutation during Omicron evolution. Since some mutations have been acquired multiple times, the mean value of fitness gain among acquisition events was used as the “fitness gain [per mutation]” score. The top 20 mutations regarding this score are shown with the protein domain information. e Enrichment of fitness-associated mutations in the RBD, particularly in its RBM. The negative score is clipped to 0. f Mapping the site-wise fitness gain score on the 3D structure of the ancestral D614G S protein (PDB: 7BNN). If multiple mutation types are present in a specific site, the maximum value is shown as the “fitness gain [per site]” score. Amino acid side chains for the top 15 sites regarding this score are shown as sphere. The plot was generated using Chimera X. g Association of fitness gain rank with the mean mAb escape score. This escape score was calculated as the mean of the escape score across mAbs over a mutation. The ND group includes mutations not observed in our phylogenetic analysis. The categories 1–50, 51–100, 101–, and ND include 39, 24, 75, and 1964 entries, respectively. The box represents the interquartile range (IQR; 25th to 75th percentile), with the horizontal line indicating the median (50th percentile). The whiskers extend to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles, respectively. h Association of the fitness gain [per mutation] score with the inferred acquisition count. The estimated regression curve (line) with standard error (ribbon) by Poisson regression using all mutations is shown. In addition, Nagelkerke’s pseudo R² values for Poisson regression analyses using all mutations, RBD mutations, and non-RBD mutations are shown. Source data are provided as a Source Data file.

**Fig. 5. Context-specific effect of the F456L substitution.**
a Examples of convergent acquisitions of specific substitutions. A node indicates the acquisition events, and node color denotes fitness gain at the acquisition events. Branch color denotes the presence (gray) or absence (light gray) of specific substitutions in the reconstructed ancestral S protein sequences. b Fitness gain upon F456L in each backbone S protein sequence, inferred by in silico mutational scanning using CoVFit. Variants with available DMS data (shown in (d)) were included in this analysis. c Site-wise immune escape score for the ancestral D614G strain, BA.2, and XBB variants, estimated by mAb escape estimator based on Cao’s DMS data. The top 5 sites regarding the escape score are annotated. d Effect of F456L on the S protein’s expression (stability) and ACE2-binding affinity, extracted from publicly available DMS data from Taylor and Starr. The dot color indicates inferred fitness gain shown in (b). Higher values indicate enhanced higher expression and ACE2-binding affinity values. Source data are provided as a Source Data file.

**Fig. 6. CoVFit-based in silico DMS on the BA.2.86.1 lineage.**
a Association between the fitness gain [per site] score and the mutation frequency at each site in the BA.2.86.1 lineage. Points represent amino acid sites, while dashed lines indicate the 98th percentile (top 2%) for both the fitness gain score and mutation frequency. Statistical measures quantifying the degree of overlap between data points within the top 2% for these two metrics are shown. The p value was calculated using a two-sided Fisher’s exact test. b Temporal trend in mutation frequency at individual amino acid sites within the BA.2.86.1 population. The genome surveillance data from October 1, 2023, to July 31, 2024, was used. Frequencies were calculated using 7-day bins. c Temporal trends in viral lineage frequencies within the BA.2.86.1 population. Each viral lineage category includes its descendant lineages unless those descendant lineages are explicitly defined as separate categories. Mutations in the S protein relative to BA.2.86.1 are indicated, with emphasis on those with higher fitness gain [per site] scores. Source data are provided as a Source Data file.

See this image and copyright information in PMC

References

1. Pybus, O. G. & Rambaut, A. Evolutionary analysis of the dynamics of viral infectious disease. Nat. Rev. Genet.10, 540–550 (2009). - DOI - PMC - PubMed
1. Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat. Rev. Microbiol.21, 162–177 (2023). - PMC - PubMed
1. Markov, P. V. et al. The evolution of SARS-CoV-2. Nat. Rev. Microbiol.21, 361–379 (2023). - DOI - PubMed
1. Obermeyer, F. et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science376, 1327–1332 (2022). - DOI - PMC - PubMed
1. Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature579, 270–273 (2020). - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Supplementary concepts

Actions

LinkOut - more resources

Full Text Sources
- Nature Publishing Group
- PubMed Central
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A protein language model for exploring viral fitness landscapes

Collaborators

Affiliations

A protein language model for exploring viral fitness landscapes

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Supplementary concepts

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous