Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Oct;18(10):1169-1180.
doi: 10.1038/s41592-021-01283-4. Epub 2021 Oct 4.

Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Affiliations
Review

Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Mohammed AlQuraishi et al. Nat Methods. 2021 Oct.

Abstract

Deep learning using neural networks relies on a class of machine-learnable models constructed using 'differentiable programs'. These programs can combine mathematical equations specific to a particular domain of natural science with general-purpose, machine-learnable components trained on experimental data. Such programs are having a growing impact on molecular and cellular biology. In this Perspective, we describe an emerging 'differentiable biology' in which phenomena ranging from the small and specific (for example, one experimental assay) to the broad and complex (for example, protein folding) can be modeled effectively and efficiently, often by exploiting knowledge about basic natural phenomena to overcome the limitations of sparse, incomplete and noisy data. By distilling differentiable biology into a small set of conceptual primitives and illustrative vignettes, we show how it can help to address long-standing challenges in integrating multimodal data from diverse experiments across biological scales. This promises to benefit fields as diverse as biophysics and functional genomics.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:. Deep Learning Revolution.
Improvements in prediction accuracy driven by deep learning over the last decade in (a) image recognition tasks, (b) speech recognition,, (c) quantum chemical calculations, and (d) protein structure prediction. Human baselines based on expert curators are shown as dashed blue lines.
Figure 2:
Figure 2:. Neural Network Primitives.
A powerful set of neural network building blocks makes it possible to build learnable models that encode a variety of inductive priors. Convolutional networks model regular grids such as images or sequences, inducing local structure and limited forms of spatial invariance such as indifference to shifts in images. They are generalized by group-equivariant networks that operate on arbitrary point clouds and induce local and global structure as well as more general spatial invariances including rotational and translational shifts, important in molecular applications. Recurrent networks model sequences with repeating dynamics such as time series, music, or the actions of a computational agent. Relational or graph networks reflect highly structured objects with rich interrelationships such as phylogenetic trees. Attention networks on the other hand essentially assume no underlying structure and are capable of inferring arbitrarily complex relationships, including long-range interactions that have historically been difficult to capture with conventional mathematical models. This ability has been crucial to the development of accurate methods for protein structure prediction. These primitives can be combined to yield even more complex combinations, for example group-equivariant attentional networks.
Figure 3:
Figure 3:. Differentiable Programming Fuses Principles-based and Data-driven Modeling.
(a) Three types of primitives underlie the emerging field of differentiable biology: (i) biological pattern recognizers that perform mappings too complex to be interpretable, such as predicting the DNA binding motif of a transcription factor from its structure, (ii) phenomenological priors that encode existing biological knowledge, such as known signaling pathways, and (iii) data priors that capture the data acquisition process, for example the physical process underlying mass spectrometry. (b) In conventional modeling, principles-based and data-driven approaches are used largely independently. Differentiable programming makes it impossible to build bespoke systems that intermingle the two types of approaches in a manner that best reflects the desired modeling task.
Figure 4:
Figure 4:. Protein Structure Prediction Vignette.
(a) A minimal end-to-end differentiable system for protein structure prediction accepts a variable-length protein sequence and processes it recurrently, implicitly learning sequence-torsion patterns (pink underlines indicate purely data-driven processes that do not rely on prior knowledge). These learned patterns are then converted sequentially into 3D coordinates using known (fixed) equations for converting sequences of torsion angles to Cartesian coordinates (blue underlines indicate purely knowledge-based processes that do not utilize learning.) After the final structure is produced, a rotationally- and translationally-invariant error metric computes its deviation from an experimental structure, feeding this information back into the learning loop. (b) A more advanced system for protein structure prediction, based on reported features of AlphaFold2, would accept multiple sequence alignments of protein sequences, using attention to reason over individual sequences and residues in the alignment. Based on learned sequence-structure patterns, an initial set of 3D coordinates are predicted then refined using attention mechanisms that operate directly on the 3D structure and that are equivariant to both translations and rotations. The predicted structure is then assessed using multiple error metrics which are fed back into the learning loop.
Figure 5:
Figure 5:. Protein-Protein Interaction Vignette.
An integrated system for data homogenization and prediction of protein-protein binding affinity is illustrated. The system accepts sequences of two proteins (top) that are fed to a learned energy model to quantitatively predict their disassociation rate. To train the model, multiple data types with varying degrees of precision, directness, and physical characterization are used (bottom). Depending on the data type, a different data homogenizer (D.H.) is used to bring all data modalities into congruence. For quantitative data, conventional double-sided loss functions are used to train the model whenever its predictions deviate from the ground truth. For binary data, one-sided and potentially learnable loss functions are used (see main text) to only penalize predictions that are clearly in conflict with the ground truth. The entire model, including the parameters of the energy model and the data homogenizers, is trained jointly using an inner loop for the energy model and an outer loop for the data homogenizers to ensure correct training behavior. A key assumption of the model is that the number of distinct experimental conditions and assays is substantially smaller than the number of distinct data points (right). Otherwise, the model is non-identifiable. Throughout the illustration green indicates raw data, blue indicates terms coming from principles-based modeling, and pink indicates learnable quantities.

Similar articles

Cited by

References

    1. Abadi Martín et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://tensorflow.org/ (2015).
    1. Paszke A et al. Automatic differentiation in PyTorch. (2017).
    1. Bradbury James, Frostig Roy, Hawkins Peter, Matthew James Johnson. JAX: Autograd and XLA. (Google, 2021).
    1. LeCun Y, Bengio Y & Hinton G Deep learning. Nature 521, 436–444 (2015). - PubMed
    1. He K, Zhang X, Ren S & Sun J Deep Residual Learning for Image Recognition. ArXiv151203385 Cs (2015).

Publication types