Ab initio prediction of transcription factor targets using structural knowledge

Tommy Kaplan¹, Nir Friedman, Hanah Margalit

Affiliations

PMID: 16103898
PMCID: PMC1183507
DOI: 10.1371/journal.pcbi.0010001

Ab initio prediction of transcription factor targets using structural knowledge

Tommy Kaplan et al. PLoS Comput Biol. 2005 Jun.

. 2005 Jun;1(1):e1.

doi: 10.1371/journal.pcbi.0010001. Epub 2005 Jun 24.

Authors

Tommy Kaplan¹, Nir Friedman, Hanah Margalit

Affiliation

¹ School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel.

PMID: 16103898
PMCID: PMC1183507
DOI: 10.1371/journal.pcbi.0010001

Abstract

Current approaches for identification and detection of transcription factor binding sites rely on an extensive set of known target genes. Here we describe a novel structure-based approach applicable to transcription factors with no prior binding data. Our approach combines sequence data and structural information to infer context-specific amino acid-nucleotide recognition preferences. These are used to predict binding sites for novel transcription factors from the same structural family. We demonstrate our approach on the Cys(2)His(2) Zinc Finger protein family, and show that the learned DNA-recognition preferences are compatible with experimental results. We use these preferences to perform a genome-wide scan for direct targets of Drosophila melanogaster Cys(2)His(2) transcription factors. By analyzing the predicted targets along with gene annotation and expression data we infer the function and activity of these proteins.

PubMed Disclaimer

Conflict of interest statement

Competing interests. The authors have declared that no competing interests exist.

Figures

**Figure 1. The Canonical Cys₂His₂ Zinc Finger DNA Binding Model**
Residues at positions 6, 3, 2, and −1 (relative to the beginning of the α-helix) at each finger interact with adjacent nucleotides in the DNA molecule (interactions shown with arrows). (Figure adapted from a figure by Prof. Aaron Klug, with permission.)

**Figure 2. Estimating DNA-Recognition Preferences**
The DNA-recognition preferences are estimated from unaligned pairs of transcription factors and their DNA targets [2] (above). The EM algorithm [13] is used to simultaneously assess the exact binding positions of each protein–DNA pair (bottom right), and to estimate four sets of position-specific DNA-recognition preferences (bottom left).

**Figure 3. Predicting the DNA Binding Site Motifs of Novel Transcription Factors**
The protein's DNA-binding domains are identified using the Cys₂His₂ conserved pattern (top left). The residues at the key positions (6, 3, 2 and −1) of each finger (marked in red in the bottom left panel) are then assigned onto the canonical binding model (bottom right), and the sets of position-specific DNA-recognition preferences (top right panel) are used to construct a probabilistic model of the DNA binding site. For example, the lysine at the sixth position of the third finger faces the first position of the binding site (dotted blue arrow). We predict the nucleotide probabilities at this position using the appropriate recognition preferences (dotted black arrow).

**Figure 4. Four Sets of Position-Specific DNA-Recognition Preferences in Zinc Fingers**
The estimated sets of DNA-recognition preferences for the DNA-binding residues at positions 6, 3, 2, and −1 of the Zinc Finger domain are displayed as sequence logos. At each position, the associated distribution of nucleotides is displayed for each amino acid. The total height of letters represents the information content (in bits) of the position, and the relative height of each letter represents its probability. Color intensity indicates the level of confidence for a given amino acid at a certain position (where paler colors indicate lower confidence due to low occurrences of the amino acid at the specific position in the training data) (see Tables S1 and S2 for full data). Some of the DNA binding preferences are general, regardless of the residue's position within the zinc finger (e.g., lysine's tendency to bind guanine), while others are position-dependent (e.g., the tendency of phenylalanine to bind cytosine only when in position 2).

**Figure 5. Validation of DNA-Recognition Preferences**
(A) The predicted binding site model of human Sp1 protein is compared to its known site (matrix V$SP1_Q6 from TRANSFAC [2], based on 108 aligned binding sites). To prevent bias by known Sp1 sites in our training data, the set of DNA-recognition preferences was estimated from the TRANSFAC data after removing all Sp1 sites. (B) Scanning the 300-bp-long promoter of human dihydrofolate reductase (DHFR) by the predicted Sp1 binding model. The p-value of each potential binding site is shown (y-axis). Four positions achieved a significant p-value (higher than the horizontal red line), out of which three are known Sp1 binding sites [41] (arrows). (C) A summary of in silico binding experiments for 21 pairs of Zinc Finger transcription factors and their target promoters. Shown is the tradeoff between false positive rate (x-axis) and true positive rate (y-axis) as the significance threshold for putative binding sites is changed. For every threshold point, our set of recognition preferences (EM) achieves better accuracy than the preferences of Mandel-Gutfreund et al. [5] (M-G) and Benos et al. [15] (SAMIE). Interestingly, when the DNA-recognition preferences were estimated from training data expanded to include TRANSFAC's artificial sequences, inferior results were obtained (dotted red line). (D) Cumulative distribution of Sp1 scores among the sequences of targets/non-targets of unbiased chromatin immunoprecipitation scans of human Chromosomes 21 and 22 [16]. The predicted Sp1 motif appears in 45% of the experimentally bound sequences but in only 5% of the control sequences.

Figure 6. Inferring the Function and Activity of Zinc Finger Transcription Factors in *D. melanogaster*
(A) Similar gene annotation enrichment among the putative target sets of 29 transcription factors in *D. melanogaster*. Blue cells correspond to significant overabundance of a GO term (row) among the predicted targets of a protein (column), using a hyper-geometric test. The binding sites of most factors show enrichment in at least one GO term. For some of the regulators, the enriched GO terms match prior biological knowledge. For example, the putative targets of Glass (gl) were found to be enriched with terms related to photoreceptor cell development (red circle 1). Similarly, the putative targets of Buttonhead (btd) and Sp1 were enriched with developmental terms such as neurogenesis, development, and organogenesis (red circle 2). Closely related GO annotations are not shown; see Figure S4 for full results. (B) Deducing the activity of the 29 transcription factors using gene expression patterns. Expression data from early (0–12 h) embryogenesis [20] and data from the entire *Drosophila* life cycle [21] are used to test whether the putative direct targets of a regulator are expressed differently than the rest of the genes in a given experiment. Red cells correspond to significant enrichment of overexpressed targets using a Kolmogorov-Smirnov test, while green cells correspond to enrichment of underexpressed targets. For most of the regulators the analysis resulted in at least one significant embryogenesis experiment, suggesting an active role in early developmental stages (above). Similar results were obtained using the full life cycle gene expression data (below).

See this image and copyright information in PMC

References

1. Stormo GD. DNA binding sites: Representation and discovery. Bioinformatics. 2000;16:16–23. - PubMed
1. Wingender E, Chen X, Fricke E, Geffers R, Hehl R, et al. The TRANSFAC system on gene expression regulation. Nucleic Acids Res. 2001;29:281–283. - PMC - PubMed
1. Luscombe NM, Laskowski RA, Thornton JM. Amino acid–base interactions: A three-dimensional analysis of protein–DNA interactions at an atomic level. Nucleic Acids Res. 2001;29:2860–2874. - PMC - PubMed
1. Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid–base interaction: Implications for prediction of protein–DNA binding sites. Nucleic Acids Res. 1998;26:2306–2312. - PMC - PubMed
1. Mandel-Gutfreund Y, Baron A, Margalit H. A structure-based approach for prediction of protein binding sites in gene upstream regions. Pac Symp Biocomput. 2001;2001:139–150. - PubMed

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Ab initio prediction of transcription factor targets using structural knowledge

Affiliation

Ab initio prediction of transcription factor targets using structural knowledge

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Molecular Biology Databases