Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Oct 20;22(1):510.
doi: 10.1186/s12859-021-04437-5.

Fast activation maximization for molecular sequence design

Affiliations

Fast activation maximization for molecular sequence design

Johannes Linder et al. BMC Bioinformatics. .

Abstract

Background: Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.

Results: Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp's capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor.

Conclusions: Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.

Keywords: Activation maximization; DNA; Deep learning; Design; Gradient ascent; Neural network; Optimization; Protein; RNA; Sequence design.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Fast activation maximization for sequence design. a The Fast SeqProp pipeline. A normalization layer is prepended to a softmax layer, which is used as parameters to a sampling layer. b Maximizing the predictors DragoNN (SPI1), DeepSEA (CTCF Dnd41), MPRA-DragoNN (SV40), Optimus 5’ and APARENT
Fig. 2
Fig. 2
Example designed sequences. Softmax sequences (PSSMs) generated by the PWM and Fast SeqProp methods after 20,000 updates of gradient ascent updates with default optimizer parameters (Adam). The logit matrices l were uniformly randomly initialized prior to optimization. Identified cis-regulatory motifs annotated above each sequence
Fig. 3
Fig. 3
Regularized sequence design. a Top: VAE-regularized Fast SeqProp. A variational autoencoder (VAE) is used to control the estimated likelihood of designed sequences during gradient ascent optimization. Bottom: Estimated VAE log likelihood distribution of random sequences (green), test sequences from the MPRA-DragoNN dataset (orange) and designed sequences (red), using Fast SeqProp without and with VAE regularization (top and bottom histogram respectively). b Oracle fitness score trajectories (APARENT, MPRA-DragoNN and Optimus 5’) and validation model score trajectories (DeeReCT-APA, iEnhancer-2L and retrained Optimus 5’) as a function of the cumulative number of predictor calls made during the sequence design phase. Shown are the median scores across 10 samples per design method, for three repeats. c Example designed sequences for APARENT, MPRA-DragoNN and Optimus 5’, using Fast SeqProp with and without VAE-regularization. Oracle and validation model scores are annotated on the right
Fig. 4
Fig. 4
Protein structure optimization. a Protein sequences are designed to minimize the KL-divergence between predicted and target distance and angle distributions. The one-hot pattern is used for two of the trRosetta inputs. b Generating sequences which conform to the target predicted structure of a Sensor Histidine Kinase. Simulated Annealing was tested at several initial temperatures, with 1 substitution per step. Similarly, SeqProp and Fast SeqProp was tested at several combinations of learning rate and momentum. c Predicted residue distance distributions after 200 iterations

References

    1. Biswas S, Kuznetsov G, Ogden PJ, Conway NJ, Adams RP, Church GM. Toward machine-guided design of proteins. bioRxiv; 2018.
    1. Greener JG, Moffat L, Jones DT. Design of metalloproteins and novel protein folds using variational autoencoders. Sci Rep. 2018;8:1–12. doi: 10.1038/s41598-018-34533-1. - DOI - PMC - PubMed
    1. Anishchenko I, Chidyausiku TM, Ovchinnikov S, Pellock SJ, Baker D. De novo protein design by deep network hallucination. bioRxiv; 2020. - PMC - PubMed
    1. Wang Y, Wang H, Liu L, Wang X. Synthetic promoter design in Escherichia coli based on generative adversarial network. bioRxiv; 2019. - PubMed
    1. Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell. 2021;3:324–333. doi: 10.1038/s42256-021-00310-5. - DOI