Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Feb 18;43(3):1965-84.
doi: 10.1093/nar/gku1395. Epub 2015 Jan 15.

A systematic survey of the Cys2His2 zinc finger DNA-binding landscape

Affiliations

A systematic survey of the Cys2His2 zinc finger DNA-binding landscape

Anton V Persikov et al. Nucleic Acids Res. .

Abstract

Cys2His2 zinc fingers (C2H2-ZFs) comprise the largest class of metazoan DNA-binding domains. Despite this domain's well-defined DNA-recognition interface, and its successful use in the design of chimeric proteins capable of targeting genomic regions of interest, much remains unknown about its DNA-binding landscape. To help bridge this gap in fundamental knowledge and to provide a resource for design-oriented applications, we screened large synthetic protein libraries to select binding C2H2-ZF domains for each possible three base pair target. The resulting data consist of >160 000 unique domain-DNA interactions and comprise the most comprehensive investigation of C2H2-ZF DNA-binding interactions to date. An integrated analysis of these independent screens yielded DNA-binding profiles for tens of thousands of domains and led to the successful design and prediction of C2H2-ZF DNA-binding specificities. Computational analyses uncovered important aspects of C2H2-ZF domain-DNA interactions, including the roles of within-finger context and domain position on base recognition. We observed the existence of numerous distinct binding strategies for each possible three base pair target and an apparent balance between affinity and specificity of binding. In sum, our comprehensive data help elucidate the complex binding landscape of C2H2-ZF domains and provide a foundation for efforts to determine, predict and engineer their DNA-binding specificities.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic of bacterial one-hybrid protein selections. (A) Schematic of F2 (top) and F3 (bottom) protein selections. Individual C2H2-ZF domains are selected in the context of a protein containing an array of three domains. The fixed C2H2-ZF domains are shown as solid colors while the variable C2H2-ZF domain is shown as a rainbow. Primary contacts with the bases are shown with arrows. The individual selections place a unique 3bp target in the appropriate position, noted as yellow bases (b1, b2 and b3), to assay the interaction of the variable C2H2-ZF domain. Underneath, the bases of the primary strand shown 5′ to 3′ are noted. Above each C2H2-ZF domain, the sequence of the recognition helix is shown N to C, with each variable position shown as a red ‘X’. (B) Schematic of the C2H2-ZF selection. (Top) Proteins are expressed as a 3-fingered protein-direct fusion to the omega subunit of RNA polymerase. C2H2-ZF domains are selected to bind the target sequence placed 10bp upstream of the promoter that drives the reporter genes, HIS3 and URA3, as described in Supplemental Methods 1b. In the example shown, C2H2-ZF domains would be selected from the F3 library to bind the 5′-ACC-3′ (shown in green). (Bottom) Two plasmids, the protein expression vector (here shown from the F3 library) and the target reporter vector, are transformed into the bacterial strain. Double transformants are plated on selective media. DNA is recovered from the cells and the region of the library vector that codes for the variable region is sequenced. Enriched amino acid sequences are shown as a sequence logo.
Figure 2.
Figure 2.
Comprehensive protein selections across all 3bp targets. (A) The total numbers of distinct interfaces (i.e. protein-target pairs), proteins, core interfaces (i.e. core sequence-target pairs) and core sequences are listed. These are further separated into sequences recovered per finger position (F2 or F3) and stringency of selection (low or high). Three combinations of these primary data sets are also considered: F2 union, F3 union and the combined F2+F3 data sets. (B) For each variable amino acid position of the selected proteins, we display a boxplot of the Shannon entropy of the distribution of amino acids selected, computed individually across each of the 64 possible DNA targets in the F2+F3 data set. Shown are the median and the interquartile range, with whiskers on the top and bottom representing the maximum and minimum data points within 1.5 times the interquartile range. The higher entropy of the two amino acid positions not included in the canonical binding model (1 and 5) indicates that these positions are more variable than the core positions of the recognition helix (-1,2,3,6) with respect to a particular 3bp DNA target and suggests that amino acids in these non-canonical positions are less important for DNA-binding specificity. Entropy scores are significantly higher for positions 1 and 5 than for position 6, as judged by one-tailed Mann–Whitney U-tests (P < 0.003, P < 2.2e-16, respectively). (C) The total number of distinct sequences recovered from protein selections for each 3bp target, shown 5′ to 3′, for the combined F2+F3 data set. Blue is with respect to only the core positions of the C2H2-ZF domain, while blue plus gray is with respect to all six varied amino acid positions. (D) The normalized mutual information between base and amino acid positions computed on the set of distinct domain-DNA interfaces in the F2+F3 data set.
Figure 3.
Figure 3.
Visualization of C2H2-ZF domain–DNA interactions. (A) The F2 union data set in a network representation. Each core sequence is shown as a blue circle, with the size of the circle proportional to the frequency with which this sequence appeared across the data set. Each 3bp target is shown as an orange hexagon, with a connection between a target and each core sequence that is bound to it. Core sequences that bound multiple targets are shown on the inside of the circle. For visualization purposes, only core sequences that occur with frequencies greater than 1% are shown. The transparency level of an edge between a core sequence and a 3bp target corresponds to the frequency with which that core sequence was observed in the selection for that 3bp target. (B) The induced sub-network consisting of the targets AAC, CAC, GAC and TAC, and the core sequences bound to them. (C) The frequency weighted overlap (Supplemental Methods 2b) of the core sequences binding each pair of targets for the F2 union data set shown as a heat map, with evident patterns illustrating higher levels of overlap between targets that differ by one nucleotide. A high-resolution version of this figure can be viewed in the Supplementary Material online.
Figure 4.
Figure 4.
C2H2-ZF domains designed to bind nearly every 3bp target. Experimentally determined (left) and computationally inferred (right) DNA-binding specificities for 64 designed C2H2-ZF domains, visualized as sequence logos (Supplemental Methods 2c). The amino acid sequence of each tested finger is given below each pair of logos. Experimental specificities were determined for C2H2-ZFs as F2 of a Zif268-based construct (Supplementary Figure S1). Specificities were computationally inferred from binding profiles using our lookup procedure.
Figure 5.
Figure 5.
C2H2-ZF activity in yeast. C2H2-ZFs chosen from the CAA and ATG selections were expressed as F2 in yeast and challenged to activate an affinity-related GFP reporter using test binding sites placed at a critical position within the promoter (as described in (17)). (Top) Test C2H2-ZF sequences, listed to the left of each row, were chosen from the CAA selections and challenged to bind the six 3bp targets noted below each column of the chart. Alterations to the preferred target (CAA) are noted by bold, red letters. GFP expression for each protein-3bp target combination is normalized to the expression of the positive control and the data are shown as a heat map. The key denotes normalized GFP expression for protein–DNA interaction as compared to known affinity measures relative to the positive control Z3EV (Zif268) paired with its optimal target. A comparison of each protein–DNA interaction to the key provides an approximation of relative affinity. The B1H produced specificity of each zinc finger domain tested as F2 is displayed (as a sequence logo) to the right of each row. (Bottom) Test C2H2-ZF sequences, listed to the left of each row, were chosen from the ATG (top three rows) and CTG (bottom row) selections and challenged to bind the six 3bp targets noted below each column of the chart. A heat map of normalized GFP expressions for each protein-3bp target combination is shown as in the CAA chart above and B1H produced sequence logos are listed to the right of each row.
Figure 6.
Figure 6.
Common versus specialized solutions for binding similar 3bp targets. We assigned each C2H2-ZF sequence recovered from a set of protein selections to its most preferred 3bp target (according to the lookup procedure described in the Materials and Methods section) and subsequently clustered the set of sequences assigned to each 3bp target. Sequence logo representations of the full set of sequences assigned to each of the four similar 3bp targets, CAC, AAC, GAC and TAC (based on data obtained from the F2 union protein selections) are displayed in large boxes and labeled by the target. Logos for clusters of similar sequences assigned to a given 3bp target are pointed to by arrows originating from the corresponding boxed logo. Similar clusters derived from sequences assigned to different targets (shown in the center) suggest ‘common solutions’ for specifying nAC targets. However, ‘specialized solutions’ to each of the individual targets (grouped near the corresponding boxed logo) are also apparent.
Figure 7.
Figure 7.
Variation in amino acid-base pairings for C2H2-ZF domain–DNA interactions. (A) All C2H2-ZF domains inferred to prefer CAA (based on F2 union protein selections) represented in sequence logo format (top). Protein sequences were clustered into distinct groups of similar proteins (middle). DNA-binding specificities were experimentally determined for a representative protein from each shown cluster, protein sequence noted below (bottom). (B) A simple code of specificity has been described based on C2H2-ZF selection and structural data (Supplementary Figure S11; (6)). Selected C2H2-ZFs complement and contradict this code. Each row represents an alpha-helix sequence position and each column a predicted base preference. In each box, the residue(s) thought to give the desired base preference is noted in the upper left. The sequence of the finger tested as F2 is noted below each logo with the critical amino acid shown in red. The top sequence of each box is consistent with the simple code. The bottom sequence contradicts the code. The cartoon to the right highlights the row-specific contact with a red arrow and yellow base. (C) DNA-binding specificities were determined for fingers that offer a shared amino acid at position 6 of the alpha-helix. Despite the conserved residue, the complementary base (noted with a red arrow) differs depending upon other residues of the test finger. Asn6, Asp6 and Arg6 examples are shown.
Figure 8.
Figure 8.
Positional context in domain–DNA interactions. (A) Weighted fractions of core sequences found in F2 or F3 high stringency selections that are also found in F2 or F3 low stringency selections (Supplemental Methods 2b). Shown left to right are: F2 high in F2 low, F2 high in F3 low, F3 high in F2 low and F3 high in F3 low. For each of these, weighted fractions are computed for each 3bp target and are depicted as boxplots. (B) The DNA-binding specificities of 26 core sequences were tested in both the F2 and F3 positions. Each set of three bars along the x-axis represents the 3bp specificity (5′ to 3′) in both positions for one core sequence. The y-axis represents the difference in the frequency with which a base is observed when comparing the F2 and F3 specificities of that same core sequence. If the base is more commonly observed in the F2 position, the bar is above the x-axis (base indicated by the color key, right). If the base is more commonly observed in the F3 position, the bar is below the axis. The closer this difference is to zero, the more similar the specificities are of the given core sequence in F2 and F3. The first group of core sequences exhibited similar DNA-binding specificities when tested in F2 and F3. The second group exhibited DNA binding when tested in F2 but either no detectable binding or extremely weak binding when tested in F3. The third group exhibited DNA binding when tested in F3 but not when tested in F2. (C) Sequence logos of fingers that function in either F2 (top) or F3 (bottom), with no colony growth or weak (as depicted by a star) DNA-binding specificity observed in the other position. (D) For each of the 166 binding site selections performed in the F2 context and 69 binding site selections performed in the F3 context, we computed the information content (IC) of each experimentally determined base position. Specifically, for each base position, we compute the Shannon entropy of the distribution of bases to uncover its variability and subtract this value from the maximum possible value (2 bits) to obtain its IC. For each of these base positions, we depict a side-by-side boxplot of the distribution of ICs across C2H2-ZF sequences tested in F2 and F3. Shown in each boxplot are the median and the interquartile range, with whiskers on the top and bottom representing the maximum and minimum data points within 1.5 times the interquartile range. For each position, IC is significantly lower for F3 binding site selections than for F2 binding site selections (red and blue boxes, respectively) as judged by a Mann–Whitney U-test (P < 10−8, P < 10−7 and P < .02 for base positions 1, 2 and 3, respectively).
Figure 9.
Figure 9.
Performance of the nearest neighbor decomposition (NN) approach within and across positional contexts. Accuracy of predictions using nearest neighbor decomposition based upon either F2 or F3 protein selection data (training sets) when predicting the specificities of C2H2-ZFs experimentally tested in either the F2 or F3 positions (test sets). (A) Fraction of correctly predicted per-nucleotide base preferences, as judged by a Pearson correlation coefficient > = 0.5. (B) The fraction of predicted 3bp binding specificities that have 0, 1, 2 or 3 base preferences correctly predicted. For both (A) and (B), shown left to right are performances in predicting DNA-binding specificities tested as: F2 when nearest neighbor uses F2 union protein selection data; F2 when nearest neighbor uses F3 union protein selection data; F3 when nearest neighbor uses F3 union protein selection data; and F3 when nearest neighbor uses F2 union protein selection data.

References

    1. Vaquerizas J.M., Kummerfeld S.K., Teichmann S.A., Luscombe N.M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet. 2009;10:252–263. - PubMed
    1. Tupler R., Perini G., Green M.R. Expressing the human genome. Nature. 2001;409:832–833. - PubMed
    1. Sommer R.J., Retzlaff M., Goerlich K., Sander K., Tautz D. Evolutionary conservation pattern of zinc-finger domains of Drosophila segmentation genes. Proc. Natl. Acad. Sci. U.S.A. 1992;89:10782–10786. - PMC - PubMed
    1. Myers S., Bowden R., Tumian A., Bontrop R.E., Freeman C., MacFie T.S., McVean G., Donnelly P. Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science. 2010;327:876–879. - PMC - PubMed
    1. Phillips J.E., Corces V.G. CTCF: master weaver of the genome. Cell. 2009;137:1194–1211. - PMC - PubMed

Publication types