Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2020 Sep 18;13(9):253.
doi: 10.3390/ph13090253.

Data-Driven Molecular Dynamics: A Multifaceted Challenge

Affiliations
Review

Data-Driven Molecular Dynamics: A Multifaceted Challenge

Mattia Bernetti et al. Pharmaceuticals (Basel). .

Abstract

The big data concept is currently revolutionizing several fields of science including drug discovery and development. While opening up new perspectives for better drug design and related strategies, big data analysis strongly challenges our current ability to manage and exploit an extraordinarily large and possibly diverse amount of information. The recent renewal of machine learning (ML)-based algorithms is key in providing the proper framework for addressing this issue. In this respect, the impact on the exploitation of molecular dynamics (MD) simulations, which have recently reached mainstream status in computational drug discovery, can be remarkable. Here, we review the recent progress in the use of ML methods coupled to biomolecular simulations with potentially relevant implications for drug design. Specifically, we show how different ML-based strategies can be applied to the outcome of MD simulations for gaining knowledge and enhancing sampling. Finally, we discuss how intrinsic limitations of MD in accurately modeling biomolecular systems can be alleviated by including information coming from experimental data.

Keywords: Markov state models; collective variables; dimensionality reduction; experimental data; machine learning; maximum entropy principle; reaction coordinates.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Pictorial representation of data exploitation in molecular dynamics (MD) simulations. Note that the source of data can be either computational (the very output of MD simulations, paths “1” and “2”) or experimental (path “3”). Path “1” refers to the use of machine learning (ML) methods for the conventional analysis step performed a posteriori once the MD data have been generated. Path “2” depicts a loop where ML methods enter during the simulations to inform subsequent MD runs (specifically consisting of simulation runs, data generation, and ML-based data analysis). This loop can be either discontinuous (MD/ML resampling) or seamless (on-the-fly MD/ML).
Figure 2
Figure 2
Pictorial representation of the unsupervised learning class of methods: cluster analysis (panel (A)) and dimensionality reduction (panel (B), principal component analysis (PCA) is displayed as a representative example).
Figure 3
Figure 3
Schematic representation of the difference between the Euclidean and geodesic distance (green solid and red dashed lines, respectively) evaluated in a curved manifold. The network-based nearest neighbor approximation of the geodesic distance provided by Isomap is also shown (red solid lines).
Figure 4
Figure 4
Difference between the components extracted through PCA (panel (A)) and a generic independent coordinate analysis (ICA) method (panel (B)). In specific cases, ICA provides a better description of the high-d data structure, as the eigenvectors identified are not necessarily restrained to the orthogonality relationship.
Figure 5
Figure 5
Schematic representation of an autoencoder. Basing on the conformations sampled through the MD simulations (protein in blue ribbons with surrounding water molecules), a latent space can be learned and trained (blue dots in the bottom plot) in a way to reproduce at best the original input data structure (blurred blue protein on the right). The latent space information can also be used to generate previously unexplored conformations (red dots in the bottom plot and red protein on the right).
Figure 6
Figure 6
Pictorial representation of supervised learning class of methods: classification (panel (A), linear discriminant analysis (LDA) is displayed as an example) and regression (panel (B), linear regression displayed as an example,).
Figure 7
Figure 7
Using MD simulations in combination with experimental information. (A) Through a validation procedure, it is possible to estimate the agreement between computed quantities (average observable s, in the figure) and reference experimental data (sexp). (B) Correction of the sampled data through a reweighting procedure can improve the agreement between predicted (MD trajectory) and measured (experiments). (C) Enforcing the experimental information in an on-the-fly fashion, the sampled ensemble is restrained to best match the experimental one.

References

    1. Hansch C., Maloney P.P., Fujita T., Muir R.M. Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients. Nature. 1962;194:178–180. doi: 10.1038/194178b0. - DOI
    1. Hansch C., Fujita T. p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J. Am. Chem. Soc. 1964;86:1616–1626. doi: 10.1021/ja01062a035. - DOI
    1. Sliwoski G., Kothiwale S., Meiler J., Lowe E.W. Computational Methods in Drug Discovery. Pharmacol. Rev. 2014;66:334. doi: 10.1124/pr.112.007336. - DOI - PMC - PubMed
    1. Schaduangrat N., Lampa S., Simeon S., Gleeson M.P., Spjuth O., Nantasenamat C. Towards reproducible computational drug discovery. J. Cheminform. 2020;12:9. doi: 10.1186/s13321-020-0408-x. - DOI - PMC - PubMed
    1. Gasteiger J. Chemoinformatics: Achievements and Challenges, a Personal View. Molecules. 2016;21:151. doi: 10.3390/molecules21020151. - DOI - PMC - PubMed

LinkOut - more resources