A universal deep-learning model for zinc finger design enables transcription factor reprogramming

David M Ichikawa^#^{1

2}, Osama Abdin^#³, Nader Alerasool⁴, Manjunatha Kogenaru¹, April L Mueller¹, Han Wen⁴, David O Giganti¹, Gregory W Goldberg¹, Samantha Adams¹, Jeffrey M Spencer¹, Rozita Razavi^{3

4}, Satra Nim⁴, Hong Zheng^{3

4}, Courtney Gionco¹, Finnegan T Clark¹, Alexey Strokach⁵, Timothy R Hughes^{3

4}, Timothee Lionnet¹, Mikko Taipale^{3

4}, Philip M Kim^{6

7

8}, Marcus B Noyes^{9

10}

Affiliations

¹ Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA.
² Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA.
³ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
⁴ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada.
⁵ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁶ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁷ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁸ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁹ Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA. marcus.noyes@nyulangone.org.
¹⁰ Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA. marcus.noyes@nyulangone.org.

^# Contributed equally.

PMID: 36702896
PMCID: PMC10421740
DOI: 10.1038/s41587-022-01624-4

A universal deep-learning model for zinc finger design enables transcription factor reprogramming

David M Ichikawa et al. Nat Biotechnol. 2023 Aug.

. 2023 Aug;41(8):1117-1129.

doi: 10.1038/s41587-022-01624-4. Epub 2023 Jan 26.

Authors

Affiliations

¹ Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA.
² Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA.
³ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
⁴ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada.
⁵ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
⁶ Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁷ Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁸ Department of Computer Science, University of Toronto, Toronto, Ontario, Canada. pi@kimlab.org.
⁹ Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA. marcus.noyes@nyulangone.org.
¹⁰ Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA. marcus.noyes@nyulangone.org.

^# Contributed equally.

PMID: 36702896
PMCID: PMC10421740
DOI: 10.1038/s41587-022-01624-4

Abstract

Cys₂His₂ zinc finger (ZF) domains engineered to bind specific target sequences in the genome provide an effective strategy for programmable regulation of gene expression, with many potential therapeutic applications. However, the structurally intricate engagement of ZF domains with DNA has made their design challenging. Here we describe the screening of 49 billion protein-DNA interactions and the development of a deep-learning model, ZFDesign, that solves ZF design for any genomic target. ZFDesign is a modern machine learning method that models global and target-specific differences induced by a range of library environments and specifically takes into account compatibility of neighboring fingers using a novel hierarchical transformer architecture. We demonstrate the versatility of designed ZFs as nucleases as well as activators and repressors by seamless reprogramming of human transcription factors. These factors could be used to upregulate an allele of haploinsufficiency, downregulate a gain-of-function mutation or test the consequence of regulation of a single gene as opposed to the many genes that a transcription factor would normally influence.

PubMed Disclaimer

Conflict of interest statement

M.T., P.M.K. and M.B.N. are founders of TBG Therapeutics. Intellectual property has been filed on the method for generation of ZFs, the design model and the method to reprogram TFs. The remaining authors declare no competing interests.

Figures

**Fig. 1. Overview of interface-focused ZF screens.**
a, Structure of adjacent ZF domains showing their close proximity. Helical position 6 of domain 1 (red) and position −1 (blue) of domain 2 are outlined. b, Cartoon of interactions between adjacent helices and DNA. The six helical positions of the three domains are shown as circles, with the common contacts made by positions −1, 2, 3 and 6 indicated by arrows. The overlap environment, which includes the base adjacent to the library interaction and the amino acid used to specify that base, is highlighted in green. This environment is unique for each library. c, Cartoon of B1H selections. The three-fingered protein is expressed as a C-terminal fusion to the omega subunit of RNA polymerase. For each library, ZF domain 2 is randomized at six helical positions and screened for amino acid combinations able to specify each of the 64 possible NNN targets. This is done in 64 independent screens. Domains 0 and 1 bind to their known, preferred targets and thereby anchor the protein adjacent to the NNN target sequence and present an overlap environment unique to that library. Only helices able to bind the target in the unique library overlap environment will recruit the polymerase, activate the reporter and survive on selective media. d, Left, helical residues for domains 0, 1 and 2 are shown for each library screened. Domain 2 contains all possible combinations of the six helical residues whereas domain 1 is fixed in the selections but varied by library. The sixth residue of domain 1 is the side chain that will be exposed at the interface between domains 1 and 2. Domain 0 is the same in all libraries except library 1. Right, there are 64 DNA targets for domain 2 to be screened against in 64 independent selections. The fixed targets for domain 1 of each library are shown, with the overlap base color coded by nucleotide. e, Left, to assay the success of each selection we determined clusters from the data for each selection. Here we show the maximum information content at one position of the strongest cluster to provide a relative measure of enrichment across all selections. Right, molecular dynamic simulations were performed on all domain 1 helices in their previously characterized contexts. The number of suggested contacts between domain 1 and the DNA is shown for each library.

**Fig. 2. Specificity solutions are library specific.**
a, Top, dot plot comparison of 1 Hamming distance is provided comparing the similarity of helical strategies enriched in libraries 1–9 for three G-rich targets (right) and three G-poor targets (left). The darkness of the dot represents the similarity of the enriched populations, with darker dots being more similar. Empty spots indicate a failed target selection for one or both of the libraries compared. Bottom, normalized Hamming distance for all libraries across all targets, listed from least similar (left) to most similar (right). The targets compared above are underlined in yellow for G-poor targets and in blue for G-rich targets. b, Clusters were determined by MUSI from the enriched helices in each library selection. Three clusters are shown for four different binding sites (CCA, TTT, CCG and GAG). If a cluster was enriched in a library selection, the corresponding box is filled black in the table. c, Schematic illustration (top) and molecular dynamics snapshot (bottom) of hydrogen bonds between the arginine at position 2 of the domain 2 helix QsRYtt with the G* of the CCG* target when an asparagine is at position 6 of the adjacent finger (library 2 environment), or when an arginine is at position 6 of the adjacent finger (library 3, 9 environment). d, Left, paired format for two-finger selections using the base-skipping linker to encourage modularity, allowing test pairs (yellow) to function independently from the fixed pair (blue). Right, cartoon of B1H two-finger selections. e, The number of helices enriched in two-finger selections is shown as a factor of the number of single-finger libraries in which they originated. f, Comparison of helices enriched in the two-finger selections showing average number of single-finger libraries in which a helix originated, by binding site.

**Fig. 3. An interface-focused ZF design model.**
a, The model comprises two modules trained on single-helix B1H selections to predict residues in partially masked helices that bind 4-mer nucleotide sequences. b, The residue embeddings generated from these modules are fed into a third module that learns interhelix compatibility. The full model is trained on two-helix B1H selection data to predict residues in partially masked helix pairs that bind 7-mer nucleotide sequences. In the model architecture schematic, layer normalization is abbreviated to "layer norm." and concatenation is abbreviated to “concat”.

**Fig. 4. Performance of two-helix design model.**
a, Training and validation accuracy during pretraining step. b, Training and validation accuracy during fine-tuning step. c, Helix sequence reconstruction accuracy with different numbers of masked residues. d, Comparison of differences between predicted and real selection logos using the developed model and ZFPred based on the mean-square error (MSE) of predicted position weight matricies (PWMs) to ground-truth PWMs. e, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from the single-helix design model. f, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from single-helix B1H selections. g, Predicted logos, real B1H logos and concatenated single-helix B1H logos for test set sequences.

**Fig. 5. RTFs.**
a, Left, the ZFs of KLF6 are seamlessly replaced by designed ZFs. The consensus ZF motif, listed below, is used to guide the seamless replacement of parent ZFs. Right, sequence of the KLF6 TF and precise location of ZF replacement. b, A GFP reporter is activated with four ZF arrays designed to bind the tetO sequence. Array 3 is used to show that TFs other than KLF6 can also be reprogrammed to bind the tetO sequence and regulate the target. c, A GFP reporter is repressed by ZIM3 reprogrammed with tetO-binding array 3. This array can also be used to reprogram other repressing TFs in addition to ZIM3. d, Relative expression of *CDKN1C* by KLF6 reprogrammed with seven ZF arrays designed to bind sequences upstream of the TSS. e, Relative expression of *DPH1* repressed by ZIM3 reprogrammed with 11 ZF arrays designed to bind sequences downstream of the TSS. Source data

**Fig. 6. ZF specificity and genome-wide activity.**
a, Genome-wide RNA-seq results for CDKN1C arrays 125 and 200 and DPH1 array 15, and comparison with an array that binds the reverse complement of CDKN1C array 125. b, Left, structure of a ZF bound to DNA highlighting two potential phosphate contacts. Right, the human ZF consensus with phosphate-contacting positions highlighted in yellow (−5) and blue (9). c, qPCR comparison for activation of the on-target CDKN1C gene as well as two off-target sequences with CDKN1C array 200 with between none and eight modifications at phosphate-contacting position −5. d, RNA-seq results for CDKN1C array 200 with arginines or glutamines at the −5 position of each ZF. e, On-target qPCR results for arrays with the N-terminal (F3–8) or C-terminal (F1–6) ZF pairs removed compared to an empty vector negative control (neg.). f, Specificity of CDKN1C array 200 array with glutamine at position −5 as determined by ChIP–seq, B1H selection at low (5 mM) and high (10 mM) stringency and specificity as predicted by ZFDesign. B1H specificity is a concatenation of the specificities determined for each of the two-finger pairs. ChIP–seq peaks contained two independent motifs, suggesting that the base-skipping linker allows modular, independent binding. Source data

See this image and copyright information in PMC

References

1. Matharu N, et al. CRISPR-mediated activation of a promoter or enhancer rescues obesity caused by haploinsufficiency. Science. 2019;363:eaau0629. doi: 10.1126/science.aau0629. - DOI - PMC - PubMed
1. Dominguez AA, Lim WA, Qi LS. Beyond editing: repurposing CRISPR-Cas9 for precision genome regulation and interrogation. Nat. Rev. Mol. Cell Biol. 2016;17:5–15. doi: 10.1038/nrm.2015.2. - DOI - PMC - PubMed
1. Chen B, Altman RB. Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery. Orphanet J. Rare Dis. 2017;12:61. doi: 10.1186/s13023-017-0614-4. - DOI - PMC - PubMed
1. Gilbert LA, et al. Genome-scale crispr-mediated control of gene repression and activation. Cell. 2014;159:647–661. doi: 10.1016/j.cell.2014.09.029. - DOI - PMC - PubMed
1. Perez-Pinera P, et al. RNA-guided gene activation by CRISPR-Cas9-based transcription factors. Nat. Methods. 2013;10:973–976. doi: 10.1038/nmeth.2600. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A universal deep-learning model for zinc finger design enables transcription factor reprogramming

Affiliations

A universal deep-learning model for zinc finger design enables transcription factor reprogramming

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials