Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug;41(8):1117-1129.
doi: 10.1038/s41587-022-01624-4. Epub 2023 Jan 26.

A universal deep-learning model for zinc finger design enables transcription factor reprogramming

Affiliations

A universal deep-learning model for zinc finger design enables transcription factor reprogramming

David M Ichikawa et al. Nat Biotechnol. 2023 Aug.

Abstract

Cys2His2 zinc finger (ZF) domains engineered to bind specific target sequences in the genome provide an effective strategy for programmable regulation of gene expression, with many potential therapeutic applications. However, the structurally intricate engagement of ZF domains with DNA has made their design challenging. Here we describe the screening of 49 billion protein-DNA interactions and the development of a deep-learning model, ZFDesign, that solves ZF design for any genomic target. ZFDesign is a modern machine learning method that models global and target-specific differences induced by a range of library environments and specifically takes into account compatibility of neighboring fingers using a novel hierarchical transformer architecture. We demonstrate the versatility of designed ZFs as nucleases as well as activators and repressors by seamless reprogramming of human transcription factors. These factors could be used to upregulate an allele of haploinsufficiency, downregulate a gain-of-function mutation or test the consequence of regulation of a single gene as opposed to the many genes that a transcription factor would normally influence.

PubMed Disclaimer

Conflict of interest statement

M.T., P.M.K. and M.B.N. are founders of TBG Therapeutics. Intellectual property has been filed on the method for generation of ZFs, the design model and the method to reprogram TFs. The remaining authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of interface-focused ZF screens.
a, Structure of adjacent ZF domains showing their close proximity. Helical position 6 of domain 1 (red) and position −1 (blue) of domain 2 are outlined. b, Cartoon of interactions between adjacent helices and DNA. The six helical positions of the three domains are shown as circles, with the common contacts made by positions −1, 2, 3 and 6 indicated by arrows. The overlap environment, which includes the base adjacent to the library interaction and the amino acid used to specify that base, is highlighted in green. This environment is unique for each library. c, Cartoon of B1H selections. The three-fingered protein is expressed as a C-terminal fusion to the omega subunit of RNA polymerase. For each library, ZF domain 2 is randomized at six helical positions and screened for amino acid combinations able to specify each of the 64 possible NNN targets. This is done in 64 independent screens. Domains 0 and 1 bind to their known, preferred targets and thereby anchor the protein adjacent to the NNN target sequence and present an overlap environment unique to that library. Only helices able to bind the target in the unique library overlap environment will recruit the polymerase, activate the reporter and survive on selective media. d, Left, helical residues for domains 0, 1 and 2 are shown for each library screened. Domain 2 contains all possible combinations of the six helical residues whereas domain 1 is fixed in the selections but varied by library. The sixth residue of domain 1 is the side chain that will be exposed at the interface between domains 1 and 2. Domain 0 is the same in all libraries except library 1. Right, there are 64 DNA targets for domain 2 to be screened against in 64 independent selections. The fixed targets for domain 1 of each library are shown, with the overlap base color coded by nucleotide. e, Left, to assay the success of each selection we determined clusters from the data for each selection. Here we show the maximum information content at one position of the strongest cluster to provide a relative measure of enrichment across all selections. Right, molecular dynamic simulations were performed on all domain 1 helices in their previously characterized contexts. The number of suggested contacts between domain 1 and the DNA is shown for each library.
Fig. 2
Fig. 2. Specificity solutions are library specific.
a, Top, dot plot comparison of 1 Hamming distance is provided comparing the similarity of helical strategies enriched in libraries 1–9 for three G-rich targets (right) and three G-poor targets (left). The darkness of the dot represents the similarity of the enriched populations, with darker dots being more similar. Empty spots indicate a failed target selection for one or both of the libraries compared. Bottom, normalized Hamming distance for all libraries across all targets, listed from least similar (left) to most similar (right). The targets compared above are underlined in yellow for G-poor targets and in blue for G-rich targets. b, Clusters were determined by MUSI from the enriched helices in each library selection. Three clusters are shown for four different binding sites (CCA, TTT, CCG and GAG). If a cluster was enriched in a library selection, the corresponding box is filled black in the table. c, Schematic illustration (top) and molecular dynamics snapshot (bottom) of hydrogen bonds between the arginine at position 2 of the domain 2 helix QsRYtt with the G* of the CCG* target when an asparagine is at position 6 of the adjacent finger (library 2 environment), or when an arginine is at position 6 of the adjacent finger (library 3, 9 environment). d, Left, paired format for two-finger selections using the base-skipping linker to encourage modularity, allowing test pairs (yellow) to function independently from the fixed pair (blue). Right, cartoon of B1H two-finger selections. e, The number of helices enriched in two-finger selections is shown as a factor of the number of single-finger libraries in which they originated. f, Comparison of helices enriched in the two-finger selections showing average number of single-finger libraries in which a helix originated, by binding site.
Fig. 3
Fig. 3. An interface-focused ZF design model.
a, The model comprises two modules trained on single-helix B1H selections to predict residues in partially masked helices that bind 4-mer nucleotide sequences. b, The residue embeddings generated from these modules are fed into a third module that learns interhelix compatibility. The full model is trained on two-helix B1H selection data to predict residues in partially masked helix pairs that bind 7-mer nucleotide sequences. In the model architecture schematic, layer normalization is abbreviated to "layer norm." and concatenation is abbreviated to “concat”.
Fig. 4
Fig. 4. Performance of two-helix design model.
a, Training and validation accuracy during pretraining step. b, Training and validation accuracy during fine-tuning step. c, Helix sequence reconstruction accuracy with different numbers of masked residues. d, Comparison of differences between predicted and real selection logos using the developed model and ZFPred based on the mean-square error (MSE) of predicted position weight matricies (PWMs) to ground-truth PWMs. e, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from the single-helix design model. f, Comparison of differences between predicted and real selection logos using the two-helix model and concatenated logos from single-helix B1H selections. g, Predicted logos, real B1H logos and concatenated single-helix B1H logos for test set sequences.
Fig. 5
Fig. 5. RTFs.
a, Left, the ZFs of KLF6 are seamlessly replaced by designed ZFs. The consensus ZF motif, listed below, is used to guide the seamless replacement of parent ZFs. Right, sequence of the KLF6 TF and precise location of ZF replacement. b, A GFP reporter is activated with four ZF arrays designed to bind the tetO sequence. Array 3 is used to show that TFs other than KLF6 can also be reprogrammed to bind the tetO sequence and regulate the target. c, A GFP reporter is repressed by ZIM3 reprogrammed with tetO-binding array 3. This array can also be used to reprogram other repressing TFs in addition to ZIM3. d, Relative expression of CDKN1C by KLF6 reprogrammed with seven ZF arrays designed to bind sequences upstream of the TSS. e, Relative expression of DPH1 repressed by ZIM3 reprogrammed with 11 ZF arrays designed to bind sequences downstream of the TSS. Source data
Fig. 6
Fig. 6. ZF specificity and genome-wide activity.
a, Genome-wide RNA-seq results for CDKN1C arrays 125 and 200 and DPH1 array 15, and comparison with an array that binds the reverse complement of CDKN1C array 125. b, Left, structure of a ZF bound to DNA highlighting two potential phosphate contacts. Right, the human ZF consensus with phosphate-contacting positions highlighted in yellow (−5) and blue (9). c, qPCR comparison for activation of the on-target CDKN1C gene as well as two off-target sequences with CDKN1C array 200 with between none and eight modifications at phosphate-contacting position −5. d, RNA-seq results for CDKN1C array 200 with arginines or glutamines at the −5 position of each ZF. e, On-target qPCR results for arrays with the N-terminal (F3–8) or C-terminal (F1–6) ZF pairs removed compared to an empty vector negative control (neg.). f, Specificity of CDKN1C array 200 array with glutamine at position −5 as determined by ChIP–seq, B1H selection at low (5 mM) and high (10 mM) stringency and specificity as predicted by ZFDesign. B1H specificity is a concatenation of the specificities determined for each of the two-finger pairs. ChIP–seq peaks contained two independent motifs, suggesting that the base-skipping linker allows modular, independent binding. Source data

References

    1. Matharu N, et al. CRISPR-mediated activation of a promoter or enhancer rescues obesity caused by haploinsufficiency. Science. 2019;363:eaau0629. doi: 10.1126/science.aau0629. - DOI - PMC - PubMed
    1. Dominguez AA, Lim WA, Qi LS. Beyond editing: repurposing CRISPR-Cas9 for precision genome regulation and interrogation. Nat. Rev. Mol. Cell Biol. 2016;17:5–15. doi: 10.1038/nrm.2015.2. - DOI - PMC - PubMed
    1. Chen B, Altman RB. Opportunities for developing therapies for rare genetic diseases: focus on gain-of-function and allostery. Orphanet J. Rare Dis. 2017;12:61. doi: 10.1186/s13023-017-0614-4. - DOI - PMC - PubMed
    1. Gilbert LA, et al. Genome-scale crispr-mediated control of gene repression and activation. Cell. 2014;159:647–661. doi: 10.1016/j.cell.2014.09.029. - DOI - PMC - PubMed
    1. Perez-Pinera P, et al. RNA-guided gene activation by CRISPR-Cas9-based transcription factors. Nat. Methods. 2013;10:973–976. doi: 10.1038/nmeth.2600. - DOI - PMC - PubMed

Publication types