Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Apr;20(4):526-36.
doi: 10.1101/gr.096305.109. Epub 2010 Mar 10.

Integrating multiple evidence sources to predict transcription factor binding in the human genome

Affiliations

Integrating multiple evidence sources to predict transcription factor binding in the human genome

Jason Ernst et al. Genome Res. 2010 Apr.

Abstract

Information about the binding preferences of many transcription factors is known and characterized by a sequence binding motif. However, determining regions of the genome in which a transcription factor binds based on its motif is a challenging problem, particularly in species with large genomes, since there are often many sequences containing matches to the motif but are not bound. Several rules based on sequence conservation or location, relative to a transcription start site, have been proposed to help differentiate true binding sites from random ones. Other evidence sources may also be informative for this task. We developed a method for integrating multiple evidence sources using logistic regression classifiers. Our method works in two steps. First, we infer a score quantifying the general binding preferences of transcription factor binding at all locations based on a large set of evidence features, without using any motif specific information. Then, we combined this general binding preference score with motif information for specific transcription factors to improve prediction of regions bound by the factor. Using cross-validation and new experimental data we show that, surprisingly, the general binding preference can be highly predictive of true locations of transcription factor binding even when no binding motif is used. When combined with motif information our method outperforms previous methods for predicting locations of true binding.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Illustrative examples of the GBP of transcription factor binding. The GBP viewed using a custom track of the UCSC Genome Browser (Kent et al. 2002). (Top) A 250,000-bp region of chromosome 20 shows the GBP for transcription factor binding. Gene locations are displayed below the plot of the GBP. Most of the peaks in this image correspond to a RefSeq transcription start site. The peak in the oval labeled with a 2 does not, but is a DNase I hypersensitive region (Boyle et al. 2008). (Bottom) A zoomed in view of the peak in the oval labeled with a 1 from the top panel that is near the transcription start site of C20orf24. The exons of C20orf24 have lower probability than its immediate surrounding bases.
Figure 2.
Figure 2.
The ability of the GBP to differentiate between reported bound sites and random sites. ROC curves for a number of different methods for predicting bound locations. (X-axis) False-positive rate; (y-axis) true-positive rate. Results of predictions made by our method using cross-validation analysis for this factor (dashed line); expected performance of a random guess (solid line). Also plotted are the ROC curve for a feature based on histone modifications (dotted line) and a point for the 3′ UTR feature. These were selected since they achieved the highest and lowest average AUC values, respectively. An extended version of this plot with additional features can be found in Supplemental Figure 2.
Figure 3.
Figure 3.
Comparison of average AUC values for our GBP and individual features. This graph compares the average AUC value obtained across all 14 data sets and to the cross-validation AUC value when combining the features together using our method. The graph shows the highest average AUC value obtained when combining all features using our method. The individual values that were used to compute these averages can be found in Supplemental Table 1.
Figure 4.
Figure 4.
Comparison of AUC values for predicting if a transcription factor binding site lies within 10,000 bases of a RefSeq transcription start site. Shown are the AUC values for the ROC curves in Supplemental Figure 3. The leftmost bars plot the average over all data sets. As can be seen, by combining the GBP with the motif scanning score, we improve the prediction of regions bound by specific transcription factors.
Figure 5.
Figure 5.
Results at predicting targets for E2F2 and E2F4 using new ChIP-chip experiments. The chart shows a comparison of five methods for the task of predicting gene targets of the E2F family of transcription factors based on 13 different ChIP-chip experiments. For each method, an AUC value was computed for each of the 13 experiments. The experiments were ordered in descending order of the average AUC value across the five methods considered. The x-axis shows the position in this ordering, and the y-axis shows the AUC value corresponding to the position. The plot shows that the methods that jointly use the GBP and motif have a higher AUC value at each rank position, when compared to methods that use only the GBP or only the motif information. The experiments these correspond to can be found in Supplemental Table 9.

References

    1. Bar-Joseph Z, Gerber G, Lee T, Rinaldi N, Yoo J, Robert F, Gordon B, Fraenkel E, Jaakkola T, Young R, et al. Computational discovery of gene modules and regulatory networks. Nat Biotechnol. 2003;21:1337–1342. - PubMed
    1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Benson G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–580. - PMC - PubMed
    1. Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Pea-Castillo L, Alleyne TM, Mnaimneh S, Botvinnik OB, Chan ET, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133:1266–1276. - PMC - PubMed
    1. Beyer A, Workman C, Hollunder J, Radke D, Möller U, Wilhelm T, Ideker T. Integrated assessment and prediction of transcription factor binding. PLoS Comput Biol. 2006;2:e70. doi: 10.1371/journal.pcbi.0020070. - DOI - PMC - PubMed

Publication types

Substances

LinkOut - more resources