Fast activation maximization for molecular sequence design

Johannes Linder¹, Georg Seelig^{2

3}

Affiliations

¹ Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA. jlinder2@cs.washington.edu.
² Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
³ Department of Electrical and Computer Engineering, University of Washington, Seattle, USA.

PMID: 34670493
PMCID: PMC8527647
DOI: 10.1186/s12859-021-04437-5

Fast activation maximization for molecular sequence design

Johannes Linder et al. BMC Bioinformatics. 2021.

. 2021 Oct 20;22(1):510.

doi: 10.1186/s12859-021-04437-5.

Authors

Johannes Linder¹, Georg Seelig^{2

3}

Affiliations

¹ Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA. jlinder2@cs.washington.edu.
² Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
³ Department of Electrical and Computer Engineering, University of Washington, Seattle, USA.

PMID: 34670493
PMCID: PMC8527647
DOI: 10.1186/s12859-021-04437-5

Abstract

Background: Optimization of DNA and protein sequences based on Machine Learning models is becoming a powerful tool for molecular design. Activation maximization offers a simple design strategy for differentiable models: one-hot coded sequences are first approximated by a continuous representation, which is then iteratively optimized with respect to the predictor oracle by gradient ascent. While elegant, the current version of the method suffers from vanishing gradients and may cause predictor pathologies leading to poor convergence.

Results: Here, we introduce Fast SeqProp, an improved activation maximization method that combines straight-through approximation with normalization across the parameters of the input sequence distribution. Fast SeqProp overcomes bottlenecks in earlier methods arising from input parameters becoming skewed during optimization. Compared to prior methods, Fast SeqProp results in up to 100-fold faster convergence while also finding improved fitness optima for many applications. We demonstrate Fast SeqProp's capabilities by designing DNA and protein sequences for six deep learning predictors, including a protein structure predictor.

Conclusions: Fast SeqProp offers a reliable and efficient method for general-purpose sequence optimization through a differentiable fitness predictor. As demonstrated on a variety of deep learning models, the method is widely applicable, and can incorporate various regularization techniques to maintain confidence in the sequence designs. As a design tool, Fast SeqProp may aid in the development of novel molecules, drug therapies and vaccines.

Keywords: Activation maximization; DNA; Deep learning; Design; Gradient ascent; Neural network; Optimization; Protein; RNA; Sequence design.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Fast activation maximization for sequence design. a The Fast SeqProp pipeline. A normalization layer is prepended to a softmax layer, which is used as parameters to a sampling layer. b Maximizing the predictors DragoNN (SPI1), DeepSEA (CTCF Dnd41), MPRA-DragoNN (SV40), Optimus 5’ and APARENT

**Fig. 2**
Example designed sequences. Softmax sequences (PSSMs) generated by the PWM and Fast SeqProp methods after 20,000 updates of gradient ascent updates with default optimizer parameters (Adam). The logit matrices $l$ were uniformly randomly initialized prior to optimization. Identified cis-regulatory motifs annotated above each sequence

**Fig. 3**
Regularized sequence design. a Top: VAE-regularized Fast SeqProp. A variational autoencoder (VAE) is used to control the estimated likelihood of designed sequences during gradient ascent optimization. Bottom: Estimated VAE log likelihood distribution of random sequences (green), test sequences from the MPRA-DragoNN dataset (orange) and designed sequences (red), using Fast SeqProp without and with VAE regularization (top and bottom histogram respectively). b Oracle fitness score trajectories (APARENT, MPRA-DragoNN and Optimus 5’) and validation model score trajectories (DeeReCT-APA, iEnhancer-2L and retrained Optimus 5’) as a function of the cumulative number of predictor calls made during the sequence design phase. Shown are the median scores across 10 samples per design method, for three repeats. c Example designed sequences for APARENT, MPRA-DragoNN and Optimus 5’, using Fast SeqProp with and without VAE-regularization. Oracle and validation model scores are annotated on the right

**Fig. 4**
Protein structure optimization. a Protein sequences are designed to minimize the KL-divergence between predicted and target distance and angle distributions. The one-hot pattern is used for two of the trRosetta inputs. b Generating sequences which conform to the target predicted structure of a Sensor Histidine Kinase. Simulated Annealing was tested at several initial temperatures, with 1 substitution per step. Similarly, SeqProp and Fast SeqProp was tested at several combinations of learning rate and momentum. c Predicted residue distance distributions after 200 iterations

See this image and copyright information in PMC

References

1. Biswas S, Kuznetsov G, Ogden PJ, Conway NJ, Adams RP, Church GM. Toward machine-guided design of proteins. bioRxiv; 2018.
1. Greener JG, Moffat L, Jones DT. Design of metalloproteins and novel protein folds using variational autoencoders. Sci Rep. 2018;8:1–12. doi: 10.1038/s41598-018-34533-1. - DOI - PMC - PubMed
1. Anishchenko I, Chidyausiku TM, Ovchinnikov S, Pellock SJ, Baker D. De novo protein design by deep network hallucination. bioRxiv; 2020. - PMC - PubMed
1. Wang Y, Wang H, Liu L, Wang X. Synthetic promoter design in Escherichia coli based on generative adversarial network. bioRxiv; 2019. - PubMed
1. Repecka D, Jauniskis V, Karpus L, Rembeza E, Rokaitis I, Zrimec J, Poviloniene S, Laurynenas A, Viknander S, Abuajwa W, Savolainen O. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell. 2021;3:324–333. doi: 10.1038/s42256-021-00310-5. - DOI

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast activation maximization for molecular sequence design

Affiliations

Fast activation maximization for molecular sequence design

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials