Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Jul 15:2024.10.11.617911.
doi: 10.1101/2024.10.11.617911.

Learning Biophysical Dynamics with Protein Language Models

Affiliations

Learning Biophysical Dynamics with Protein Language Models

Chao Hou et al. bioRxiv. .

Abstract

Structural dynamics are fundamental to protein functions and mutation effects. Current protein deep learning models are predominantly trained on sequence and/or static structure data, which often fail to capture the dynamic nature of proteins. To address this, we introduce SeqDance and ESMDance, two protein language models trained on dynamic biophysical properties derived from molecular dynamics simulations and normal mode analyses of over 64,000 proteins. SeqDance, trained from scratch, learns both local dynamic interactions and global conformational properties for ordered and disordered proteins. SeqDance predicted dynamic property changes reflect mutation effect on protein folding stability. ESMDance, built upon ESM2 outputs, substantially outperforms ESM2 in zero-shot prediction of mutation effects for designed and viral proteins which lack evolutionary information. Together, SeqDance and ESMDance offer a new framework for integrating protein dynamics into language models, enabling more generalizable predictions of protein behavior and mutation effects.

Keywords: molecular dynamics; mutation effects; normal mode analysis; protein language model.

PubMed Disclaimer

Conflict of interest statement

Competing interests The authors declare no competing interests.

Figures

Figure 1:
Figure 1:. Information flow in protein study, representative protein language models, and model pre-training.
A. Illustration of the “sequence - structure ensemble - function - evolution” paradigm. Sequences are the basic elements of proteins that fold into structural ensembles to perform functions. Functionally important regions exhibit conserved patterns across homologs. B. Representative protein language models (pLMs) and their information sources. ESM1 and 2 were trained to predict masked residues and implicitly learned evolutionary information. ProSE was trained to predict masked residues, pairwise contact in static structures, and structure similarity. METL was trained to predict biophysical terms calculated from static structures. SeqDance and ESMDance were trained on protein dynamic properties from molecular dynamics (MD) simulations, experimental data, and normal mode analysis (NMA) of static structures. C. Diagram of the pre-training process. SeqDance and ESMDance take a protein sequence as input and predict residue-level and pairwise dynamic properties, which are extracted from structures ensembles and NMA. Both models use a Transformer encoder architecture identical to ESM2-35M, consisting of 12 layers with 20 heads per layer and an embedding dimension of 480. Linear layers are applied to the residue embeddings to predict residue-level properties. For pairwise property prediction, pairwise embeddings—constructed from residue embeddings—are concatenated with attention maps, a linear layer is then applied to predict pairwise properties.
Figure 2.
Figure 2.. SeqDance’s attention mechanism captures dynamic interactions in test sets.
For each attention head, Spearman correlation was calculated between attention values and pairwise dynamic properties, top five heads of SeqDance and ESM2-35M are shown. D-I show results for positively correlated residue pairs. Boxplots show the distribution of Spearman correlations of test data, The box extends from the first quartile to the third quartile of the data, with a line at the median. The whiskers extend from the box to the farthest data point lying within 1.5 times the inter-quartile range from the box.
Figure 3.
Figure 3.. SeqDance’s embeddings encode global protein conformational properties.
Performance comparison of SeqDance (35M parameters), METL, ProSE, and ESM2 (ESMDance embedding is identical to ESM2-35M) in predicting the normalized end-to-end distance of disordered proteins (A), asphericity of disordered proteins (B), normalized radius of gyration Rg of disordered proteins (C) and ordered proteins (D). The training and test split was 6:4 with a 20% sequence identity cutoff. Linear regression model was trained to predict conformational properties using the first 200 principal components of mean-pooled embeddings from each method. Results presented are the distributions of the test loss (mean squared error) of 20 independent repeats.
Figure 4.
Figure 4.. Zero-shot prediction of mutation effect on protein folding stability.
A. Framework of using SeqDance (35M parameters) or ESMDance (35M parameters) to perform zero-shot prediction of mutation effects. B. Distribution of zero-shot performance (Spearman correlation) on 412 proteins of different dynamic properties; the random model (color grey) represents the randomly initialized model prior to pre-training. SASA mean, std: mean and standard deviation of solvent-accessible surface area. NMA properties 1, 2, 3: properties calculated from low-, median-, and high-frequency normal modes. vdw: van der Waals interactions; hbbb: backbone-to-backbone hydrogen bonds; hbsb: side-chain-to-backbone hydrogen bonds; hbss: side-chain-to-side-chain hydrogen bonds; hp: hydrophobic interactions; sb: salt bridges; pc: Pi-cation interactions; ps: Pi-stacking interactions; ts: T-stacking interactions. C. SeqDance’s performance split by most similar proteins in SeqDance training set. We used cutoff of 95% sequence identity together with 95% coverage, 50% sequence identity together with 80% coverage, 20% sequence identity together with 50% coverage. D-F. Relationship between zero-shot performance and the number of homologs (20% sequence identity and 50% coverage) in training set. The line represents a polynomial regression of order two, with the shaded area indicating the 95% confidence interval. The x-axis shows the log-scaled number of homologs, with small random noises added to the x values to reduce overlap.
Figure 5.
Figure 5.. Zero-shot prediction of mutation effects for designed and viral proteins.
A-D. Performance comparison between ESM2, SeqDance (35M parameters), and ESMDance (35M parameters) for 135 designed proteins with no homologs in UniRef50 or the SeqDance training set. Two highlighted proteins are further analyzed in E-F. E-F. Structure and the relationship between predictions and ΔΔG values for two designed proteins. The kernel density estimate plot shows the distribution of experimentally measured folding ΔΔG values and zero-shot prediction values for the three methods. The line represents the linear regression, with the shaded area indicating the 95% confidence interval. G. Zero-shot performance of ESM2, SeqDance, and ESMDance on 23 viral proteins shorter than 1024 residues in ProteinGYM. H. Zero-shot Spearman correlations for the stability mutational scanning of porcine sapovirus VPg (named POLG_PESV_Tsuboyama_2023_2MXD in ProteinGYM). The structure shown here is based on ten random frames from mdCATH: 2mxdA00.

References

    1. Tunyasuvunakool K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021). - PMC - PubMed
    1. Abramson J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature (2024). - PMC - PubMed
    1. Baek M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). - PMC - PubMed
    1. Lin Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). - PubMed
    1. Kulmanov M. et al. Protein function prediction as approximate semantic entailment. Nature Machine Intelligence 6, 220–228 (2024).

Publication types