Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Jun;1(1):e1.
doi: 10.1371/journal.pcbi.0010001. Epub 2005 Jun 24.

Ab initio prediction of transcription factor targets using structural knowledge

Affiliations

Ab initio prediction of transcription factor targets using structural knowledge

Tommy Kaplan et al. PLoS Comput Biol. 2005 Jun.

Abstract

Current approaches for identification and detection of transcription factor binding sites rely on an extensive set of known target genes. Here we describe a novel structure-based approach applicable to transcription factors with no prior binding data. Our approach combines sequence data and structural information to infer context-specific amino acid-nucleotide recognition preferences. These are used to predict binding sites for novel transcription factors from the same structural family. We demonstrate our approach on the Cys(2)His(2) Zinc Finger protein family, and show that the learned DNA-recognition preferences are compatible with experimental results. We use these preferences to perform a genome-wide scan for direct targets of Drosophila melanogaster Cys(2)His(2) transcription factors. By analyzing the predicted targets along with gene annotation and expression data we infer the function and activity of these proteins.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The Canonical Cys2His2 Zinc Finger DNA Binding Model
Residues at positions 6, 3, 2, and −1 (relative to the beginning of the α-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). (Figure adapted from a figure by Prof. Aaron Klug, with permission.)
Figure 2
Figure 2. Estimating DNA-Recognition Preferences
The DNA-recognition preferences are estimated from unaligned pairs of transcription factors and their DNA targets [2] (above). The EM algorithm [13] is used to simultaneously assess the exact binding positions of each protein–DNA pair (bottom right), and to estimate four sets of position-specific DNA-recognition preferences (bottom left).
Figure 3
Figure 3. Predicting the DNA Binding Site Motifs of Novel Transcription Factors
The protein's DNA-binding domains are identified using the Cys2His2 conserved pattern (top left). The residues at the key positions (6, 3, 2 and −1) of each finger (marked in red in the bottom left panel) are then assigned onto the canonical binding model (bottom right), and the sets of position-specific DNA-recognition preferences (top right panel) are used to construct a probabilistic model of the DNA binding site. For example, the lysine at the sixth position of the third finger faces the first position of the binding site (dotted blue arrow). We predict the nucleotide probabilities at this position using the appropriate recognition preferences (dotted black arrow).
Figure 4
Figure 4. Four Sets of Position-Specific DNA-Recognition Preferences in Zinc Fingers
The estimated sets of DNA-recognition preferences for the DNA-binding residues at positions 6, 3, 2, and −1 of the Zinc Finger domain are displayed as sequence logos. At each position, the associated distribution of nucleotides is displayed for each amino acid. The total height of letters represents the information content (in bits) of the position, and the relative height of each letter represents its probability. Color intensity indicates the level of confidence for a given amino acid at a certain position (where paler colors indicate lower confidence due to low occurrences of the amino acid at the specific position in the training data) (see Tables S1 and S2 for full data). Some of the DNA binding preferences are general, regardless of the residue's position within the zinc finger (e.g., lysine's tendency to bind guanine), while others are position-dependent (e.g., the tendency of phenylalanine to bind cytosine only when in position 2).
Figure 5
Figure 5. Validation of DNA-Recognition Preferences
(A) The predicted binding site model of human Sp1 protein is compared to its known site (matrix V$SP1_Q6 from TRANSFAC [2], based on 108 aligned binding sites). To prevent bias by known Sp1 sites in our training data, the set of DNA-recognition preferences was estimated from the TRANSFAC data after removing all Sp1 sites. (B) Scanning the 300-bp-long promoter of human dihydrofolate reductase (DHFR) by the predicted Sp1 binding model. The p-value of each potential binding site is shown (y-axis). Four positions achieved a significant p-value (higher than the horizontal red line), out of which three are known Sp1 binding sites [41] (arrows). (C) A summary of in silico binding experiments for 21 pairs of Zinc Finger transcription factors and their target promoters. Shown is the tradeoff between false positive rate (x-axis) and true positive rate (y-axis) as the significance threshold for putative binding sites is changed. For every threshold point, our set of recognition preferences (EM) achieves better accuracy than the preferences of Mandel-Gutfreund et al. [5] (M-G) and Benos et al. [15] (SAMIE). Interestingly, when the DNA-recognition preferences were estimated from training data expanded to include TRANSFAC's artificial sequences, inferior results were obtained (dotted red line). (D) Cumulative distribution of Sp1 scores among the sequences of targets/non-targets of unbiased chromatin immunoprecipitation scans of human Chromosomes 21 and 22 [16]. The predicted Sp1 motif appears in 45% of the experimentally bound sequences but in only 5% of the control sequences.
Figure 6
Figure 6. Inferring the Function and Activity of Zinc Finger Transcription Factors in D. melanogaster
(A) Similar gene annotation enrichment among the putative target sets of 29 transcription factors in D. melanogaster. Blue cells correspond to significant overabundance of a GO term (row) among the predicted targets of a protein (column), using a hyper-geometric test. The binding sites of most factors show enrichment in at least one GO term. For some of the regulators, the enriched GO terms match prior biological knowledge. For example, the putative targets of Glass (gl) were found to be enriched with terms related to photoreceptor cell development (red circle 1). Similarly, the putative targets of Buttonhead (btd) and Sp1 were enriched with developmental terms such as neurogenesis, development, and organogenesis (red circle 2). Closely related GO annotations are not shown; see Figure S4 for full results. (B) Deducing the activity of the 29 transcription factors using gene expression patterns. Expression data from early (0–12 h) embryogenesis [20] and data from the entire Drosophila life cycle [21] are used to test whether the putative direct targets of a regulator are expressed differently than the rest of the genes in a given experiment. Red cells correspond to significant enrichment of overexpressed targets using a Kolmogorov-Smirnov test, while green cells correspond to enrichment of underexpressed targets. For most of the regulators the analysis resulted in at least one significant embryogenesis experiment, suggesting an active role in early developmental stages (above). Similar results were obtained using the full life cycle gene expression data (below).

References

    1. Stormo GD. DNA binding sites: Representation and discovery. Bioinformatics. 2000;16:16–23. - PubMed
    1. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001;29:281–283. - PMC - PubMed
    1. Luscombe NM, Laskowski RA, Thornton JM. Amino acid–base interactions: A three-dimensional analysis of protein–DNA interactions at an atomic level. Nucleic Acids Res. 2001;29:2860–2874. - PMC - PubMed
    1. Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid–base interaction: Implications for prediction of protein–DNA binding sites. Nucleic Acids Res. 1998;26:2306–2312. - PMC - PubMed
    1. Mandel-Gutfreund Y, Baron A, Margalit H. A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac Symp Biocomput. 2001;2001:139–150. - PubMed

LinkOut - more resources