Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 13;16(1):4236.
doi: 10.1038/s41467-025-59422-w.

A protein language model for exploring viral fitness landscapes

Collaborators, Affiliations

A protein language model for exploring viral fitness landscapes

Jumpei Ito et al. Nat Commun. .

Abstract

Successively emerging SARS-CoV-2 variants lead to repeated epidemic surges through escalated fitness (i.e., relative effective reproduction number between variants). Modeling the genotype-fitness relationship enables us to pinpoint the mutations boosting viral fitness and flag high-risk variants immediately after their detection. Here, we present CoVFit, a protein language model adapted from ESM-2, designed to predict variant fitness based solely on spike protein sequences. CoVFit was trained on genotype-fitness data derived from viral genome surveillance and functional mutation assays related to immune evasion. CoVFit successively ranked the fitness of unknown future variants harboring nearly 15 mutations with informative accuracy. CoVFit identified 959 fitness elevation events throughout SARS-CoV-2 evolution until late 2023. Furthermore, we show that CoVFit is applicable for predicting viral evolution through single amino acid mutations. Our study gives insight into the SARS-CoV-2 fitness landscape and provides a tool for efficiently identifying SARS-CoV-2 variants with higher epidemic risk.

PubMed Disclaimer

Conflict of interest statement

Competing interests: J.I. has consulting fees and honoraria for lectures from Takeda Pharmaceutical Co. Ltd Spyros Lytras has consulting fees from EcoHealth Alliance. K.S. has consulting fees from Moderna Japan Co., Ltd and Takeda Pharmaceutical Co. Ltd, and honoraria for lectures from Gilead Sciences, Inc., Moderna Japan Co., Ltd, and Shionogi & Co., Ltd. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of CoVFit.
a Conceptual framework of CoVFit. CoVFit is a protein language model designed to predict the fitness (relative Re) of SARS-CoV-2 variants based on their S protein sequences. b Outline of the training process used to develop CoVFit model instances.
Fig. 2
Fig. 2. Prediction performance of CoVFit.
a Spearman’s correlation scores for predicted relative fitness values and mAb neutralization escape scores. Scores from five cross-validation folds are shown as dots, with the mean represented by a bar and the standard deviation by an error bar. The correlation for mAbs was calculated in each epitope group. b Scatter plot for fitness prediction, aggregating results from five-fold cross-validation. Dot denotes the result of a certain viral genotype in a specific country. Dot is colored by the Nextclade clade. The relative fitness value was scaled so that the 0.1 percentile and 99.9 percentile points fall between 0 and 1. A dashed line with a slope 1 and intercept 0 is shown. c Scatter plot inherited from (b) but colored by the emergence date of each genotype. Source data are provided as a Source Data file.
Fig. 3
Fig. 3. Prediction performance of CoVFit for unknown, future variants.
a Strategy for evaluating prediction performance on future variants. Model instances, referred to as CoVFitPast, were trained on variant data prior to a specified cutoff date (e.g., January 31, 2022). Prediction performance for future variants was then assessed using data from variants that emerged after this date. b Number of sequences from each clade in the past datasets with specific cutoff dates. c Fitness predictions for future (gray) and past (light gray) variants in the dataset with a cutoff date of February 28, 2022. Points represent results for each genotype, calculated as average values across countries and five-fold predictions. A dashed line with a slope of 1 and an intercept of 0 is included. d Fitness predictions for future variants, with colors indicating Nextclade clade classifications. In addition to the dashed line with a slope of 1 and intercept 0, a gray estimated regression line, based on mean prediction values, is displayed. e Scatter plot based on (d) but colored according to the minimum amino acid distance from variants in the past data. f Predicted fitness of genotypes within each Nextclade clade. Each clade’s distribution (violin) and median value (dot) are shown. Individual panels display results for datasets with different cutoff dates. Clades present in the past data are separated by a dashed vertical line from those absent in the past data. Additionally, the median observed fitness value of each clade is represented by a heatmap on the left side. g Comparison of prediction performance metrics across methods, including Spearman’s correlation score, R-squared value, mean absolute error (MAE), and estimated regression slope. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Detection of fitness elevation events during Omicron diversification.
a Scheme to detect phylogenetic branches with fitness elevation utilizing CoVFit models. b Inference of change in fitness through Omicron’s evolution. The maximum likelihood (ML) tree of Omicron lineages is shown. Branch color indicates an inferred fitness value for each phylogenetic node, including both observed and reconstructed ancestral genotypes of S proteins in the phylogenetic tree. c Detection of fitness elevation events during Omicron’s evolution. Dot color indicates inferred fitness gain in each branch, calculated as the difference in predicted fitness between a node and its parental node. d Mean fitness gain over a specific mutation during Omicron evolution. Since some mutations have been acquired multiple times, the mean value of fitness gain among acquisition events was used as the “fitness gain [per mutation]” score. The top 20 mutations regarding this score are shown with the protein domain information. e Enrichment of fitness-associated mutations in the RBD, particularly in its RBM. The negative score is clipped to 0. f Mapping the site-wise fitness gain score on the 3D structure of the ancestral D614G S protein (PDB: 7BNN). If multiple mutation types are present in a specific site, the maximum value is shown as the “fitness gain [per site]” score. Amino acid side chains for the top 15 sites regarding this score are shown as sphere. The plot was generated using Chimera X. g Association of fitness gain rank with the mean mAb escape score. This escape score was calculated as the mean of the escape score across mAbs over a mutation. The ND group includes mutations not observed in our phylogenetic analysis. The categories 1–50, 51–100, 101–, and ND include 39, 24, 75, and 1964 entries, respectively. The box represents the interquartile range (IQR; 25th to 75th percentile), with the horizontal line indicating the median (50th percentile). The whiskers extend to the smallest and largest values within 1.5 times the IQR from the lower and upper quartiles, respectively. h Association of the fitness gain [per mutation] score with the inferred acquisition count. The estimated regression curve (line) with standard error (ribbon) by Poisson regression using all mutations is shown. In addition, Nagelkerke’s pseudo R2 values for Poisson regression analyses using all mutations, RBD mutations, and non-RBD mutations are shown. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Context-specific effect of the F456L substitution.
a Examples of convergent acquisitions of specific substitutions. A node indicates the acquisition events, and node color denotes fitness gain at the acquisition events. Branch color denotes the presence (gray) or absence (light gray) of specific substitutions in the reconstructed ancestral S protein sequences. b Fitness gain upon F456L in each backbone S protein sequence, inferred by in silico mutational scanning using CoVFit. Variants with available DMS data (shown in (d)) were included in this analysis. c Site-wise immune escape score for the ancestral D614G strain, BA.2, and XBB variants, estimated by mAb escape estimator based on Cao’s DMS data. The top 5 sites regarding the escape score are annotated. d Effect of F456L on the S protein’s expression (stability) and ACE2-binding affinity, extracted from publicly available DMS data from Taylor and Starr. The dot color indicates inferred fitness gain shown in (b). Higher values indicate enhanced higher expression and ACE2-binding affinity values. Source data are provided as a Source Data file.
Fig. 6
Fig. 6. CoVFit-based in silico DMS on the BA.2.86.1 lineage.
a Association between the fitness gain [per site] score and the mutation frequency at each site in the BA.2.86.1 lineage. Points represent amino acid sites, while dashed lines indicate the 98th percentile (top 2%) for both the fitness gain score and mutation frequency. Statistical measures quantifying the degree of overlap between data points within the top 2% for these two metrics are shown. The p value was calculated using a two-sided Fisher’s exact test. b Temporal trend in mutation frequency at individual amino acid sites within the BA.2.86.1 population. The genome surveillance data from October 1, 2023, to July 31, 2024, was used. Frequencies were calculated using 7-day bins. c Temporal trends in viral lineage frequencies within the BA.2.86.1 population. Each viral lineage category includes its descendant lineages unless those descendant lineages are explicitly defined as separate categories. Mutations in the S protein relative to BA.2.86.1 are indicated, with emphasis on those with higher fitness gain [per site] scores. Source data are provided as a Source Data file.

References

    1. Pybus, O. G. & Rambaut, A. Evolutionary analysis of the dynamics of viral infectious disease. Nat. Rev. Genet.10, 540–550 (2009). - PMC - PubMed
    1. Carabelli, A. M. et al. SARS-CoV-2 variant biology: immune escape, transmission and fitness. Nat. Rev. Microbiol.21, 162–177 (2023). - PMC - PubMed
    1. Markov, P. V. et al. The evolution of SARS-CoV-2. Nat. Rev. Microbiol.21, 361–379 (2023). - PubMed
    1. Obermeyer, F. et al. Analysis of 6.4 million SARS-CoV-2 genomes identifies mutations associated with fitness. Science376, 1327–1332 (2022). - PMC - PubMed
    1. Zhou, P. et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature579, 270–273 (2020). - PMC - PubMed

Substances

Supplementary concepts