Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct 13;16(1):9074.
doi: 10.1038/s41467-025-64103-9.

ProFlex as a linguistic bridge for decoding protein dynamics in normal mode analysis

Affiliations

ProFlex as a linguistic bridge for decoding protein dynamics in normal mode analysis

Damian J Magill et al. Nat Commun. .

Abstract

Artificial intelligence is revolutionizing structural bioinformatics, with AlphaFold arguably being the most impactful development to date. The structural atlases generated by these methods present significant opportunities for unraveling biological mysteries but also pose challenges in leveraging such massive datasets effectively. In this work, we explore the dynamic landscape of hundreds of thousands of AlphaFold-predicted structures using normal mode analysis. The resulting data serve to empirically define an alphabet summarizing relative protein flexibility, termed ProFlex. Leveraging ProFlex, we describe the flexibility information space occupied by this massive dataset. We believe leveraging the data compression offered by ProFlex-like approaches opens opportunities for understanding protein function, refining structural predictions, and rendering analyses computationally tractable.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Distribution of Raw and Scaled RMSF values with empirically defined proflex embeddings.
The distribution of min, max, mean, and median absolute RMSF values for the raw NMA dataset on a logarithmic scale (left). The distribution of min, max, mean, and median min/max scaled RMSF values for the NMA dataset (middle). Plot of the scaled RMSF empirically defined bin ranges represented in order by each of the ProFlex letters (right). Range bars represent the delineation of the scaled RMSF values for each of the ProFlex bins.
Fig. 2
Fig. 2. Comparison of equal, global, and sequence specific binning approaches to ProFlex alphabet determination.
ProFlex alphabets were back translated to scaled RMSF values using mid-point percentile values and compared to original values using two-sided Wilcoxon tests and subject to log transformation and an empirical distribution function applied (left). Examples of RMSF curves for original and back translated values are given for each binning approach (bottom). Significant differences are observed across the entire dataset when employing a global binning approach highlighting that this is an ineffective means of representing the flexibility data.
Fig. 3
Fig. 3. ProFlex robustness analysis employing train-test approach.
The ProFlex alphabet percentile bins were recalculated using random subsets of the dataset ranging from 10% to 90% and the result used to define the alphabet for the remaining unused portion of the dataset. These sequences were compared to those of the native dataset and percentage similarity calculated. Graphs show the proportion of sequences with a given level of similarity (from 100% to 0%) and the associated standard deviation for 5 replicates conducted for each percentage are presented as error bars on each graph (a). For lower percentages more variation is observed but a good representation is still captured whereas at higher percentages the graphs show significantly less variation. Overlapping distribution of the percentiles defined across each of the training sets highlights a great level of similarity between them (b). Indeed, an almost perfect superimposition is observed at all levels highlighting again that ProFlex is highly representative of the dataset.
Fig. 4
Fig. 4. ProFlex percentiles compared to those determined using ANM and SDENM forcefields.
ProFlex percentile bins were calculated using C-alpha, ANM, and SDENM forcefields for the entire SWISS-PROT AF dataset using a global dynamic binning approach. Lines show the distribution of the bins across the min-max scaled RMSF values. Points on lines indicate each of the bin boundaries, respectively. ANM and C-alpha show similar distributions with some variation observed in the elbow of the SDENM graph which would translate to ProFlex sequence variations for these flexibility ranges. This highlights the need for consistency in the approach used for modeling and simulation.
Fig. 5
Fig. 5. Global trends in the dataset with increasing sequence size.
Plot of average RMSF value against sequence size with example structures for least and most flexible representatives (a), distribution of secondary structure elements with increasing sequence size with a coolwarm color scheme corresponding to proportion of secondary structure on a scale from 0–1 (b), and distribution of pLDDT score ranges (min/max), interquartile range and mean value with respect to ProFlex letter showing a clear trend with letters representing higher flexibility (c). We observe a notable increase in coiled regions as sequence size increases with the exception of the smallest sequences likely highlighting increases in disordered domains and linker regions.
Fig. 6
Fig. 6. ProFlex variation across similar protein folds.
Percentage identity distributions for ten randomly selected 3Di clusters and their ProFlex counterparts highlighting a much greater variation in the latter. Boxplots highlight mean, interquartile range, and minimum and maximum percentages for each cluster. Example of a ProFlex sequence alignment for cluster 187181 colored sequentially by ProFlex letter using a coolwarm scheme where dark blue represents the least flexible letter (a) and dark red the most flexible (Z) to better highlight the nature of mismatches. Subtle differences in color indicate substitution of one letter by a closely related one revealing a significant contributor to the increased number of mismatches observed.
Fig. 7
Fig. 7. ProFlex K-mer based analysis of flexibility peaks.
Number of sequences were plotted against number of flexibility peaks detected across different peak flexibility thresholds ranging from mean and median ProFlex flexibility letters to larger flexibility peaks representing 25% and 50% of the total flexibility of the protein (top). Number of peaks correlated with increasing sequence size for each of the flexibility thresholds defined (bottom). Clear trends are observed with the reduction in large scale motions with increasing sequence size highlighting greater incorporation of structures into defined domains.
Fig. 8
Fig. 8. Global comparisons of amino acid, secondary structure, and ProFlex alphabets.
Occupancy of ProFlex states divided by amino acid groups (a). Secondary structure content divided by amino acids (b), and proportion of ProFlex states found within specific secondary structure elements (c). All figures use a coolwarm color scheme with lowest and highest proportions represented by dark blue and dark red, respectively.
Fig. 9
Fig. 9. Phylogenetic and similarity analysis of selected Tevenvirinae major capsid protein sequences and structures.
Similarity matrices for all alphabets have been generated on the basis of all-vs-all needleman Wunsch global alignment scores with the exception of structural comparisons (PDB) which are interpreted from inverted TM-Align scores. Matrices are represented as heatmaps with higher similarity scores represented by darker colors. All heatmaps share the same topology with phage labels provided for the amino acid matrix on the bottom left. Phylogenetic trees were constructed on the basis of amino acid alignment using the neighbor-joining method. Mantel correlation analysis provided for each of the datasets (top right) highlights the similarities in the various distance matrices generated including a combination of structural and ProFlex information.
Fig. 10
Fig. 10. Predicted vs actual RMSF values from deep learning models.
Distributions provided for mean, median, minimum, and maximum RMSF values for test set predictions using individual deep learning models trained on amino acid and ProFlex alphabet derived features. Significant relationships were observed in all cases with the exception of maximum RMSF values where the diversity of values proved problematic to predict.

References

    1. Bahar, I. & Rader, A. J. Coarse-grained normal mode analysis in structural biology. Curr. Opin. Struct. Biol.15, 586–592 (2005). - DOI - PMC - PubMed
    1. Na, H., Jernigan, R. L. & Song, G. Bridging between NMA and elastic network models: preserving all-atom accuracy in coarse-grained models. PLoS Comput. Biol.11, e1004542 (2015). - DOI - PMC - PubMed
    1. Delarue, M. Dealing with structural variability in molecular replacement and crystallographic refinement through normal-mode analysis. Acta Crystallogr. Sect. D Biol. Crystallogr.64, 40–48 (2008). - DOI - PMC - PubMed
    1. Suhre, K. & Sanejouand, Y. H. On the potential of normal-mode analysis for solving difficult molecular-replacement problems. Acta Crystallogr. D Biol. Crystallogr.60, 796–799 (2004). - DOI - PubMed
    1. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589 (2021). - DOI - PMC - PubMed

LinkOut - more resources