Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Apr 14;6(4):e18430.
doi: 10.1371/journal.pone.0018430.

A ChIP-Seq benchmark shows that sequence conservation mainly improves detection of strong transcription factor binding sites

Affiliations

A ChIP-Seq benchmark shows that sequence conservation mainly improves detection of strong transcription factor binding sites

Tony Håndstad et al. PLoS One. .

Abstract

Background: Transcription factors are important controllers of gene expression and mapping transcription factor binding sites (TFBS) is key to inferring transcription factor regulatory networks. Several methods for predicting TFBS exist, but there are no standard genome-wide datasets on which to assess the performance of these prediction methods. Also, it is believed that information about sequence conservation across different genomes can generally improve accuracy of motif-based predictors, but it is not clear under what circumstances use of conservation is most beneficial.

Results: Here we use published ChIP-seq data and an improved peak detection method to create comprehensive benchmark datasets for prediction methods which use known descriptors or binding motifs to detect TFBS in genomic sequences. We use this benchmark to assess the performance of five different prediction methods and find that the methods that use information about sequence conservation generally perform better than simpler motif-scanning methods. The difference is greater on high-affinity peaks and when using short and information-poor motifs. However, if the motifs are specific and information-rich, we find that simple motif-scanning methods can perform better than conservation-based methods.

Conclusions: Our benchmark provides a comprehensive test that can be used to rank the relative performance of transcription factor binding site prediction methods. Moreover, our results show that, contrary to previous reports, sequence conservation is better suited for predicting strong than weak transcription factor binding sites.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Defining positive and negative regions for the site benchmark.
The maximum score in each region is used to calculate the ROC curve. In the site benchmark, the negative regions around a peak are further divided into smaller regions of length 200 bp (not shown). The promoter benchmark is based on the same principle as the site benchmark, but the test regions are then derived from regions surrounding gene transcription start sites and from first introns, and the negative regions are not further divided into smaller regions.
Figure 2
Figure 2. Cumulative ROC score on site and promoter benchmarks.
The cumulative number of TF datasets for which a method has a ROC AUC of more than a given value on the A) site and B) promoter benchmark. Each line represents a method and shows for each point along the y-axis how many datasets that have at least the ROC score given on the x-axis. The ROC score, or area under the ROC curve (AUC), is a measure of accuracy that summarizes the true-positive and false-positive rate and the implied trade-offs at all score thresholds.
Figure 3
Figure 3. ROC score correlates with motif length and information content.
A) ROC score for PWM scanning as a function of motif length. B) ROC score for PWM scanning as a function of motif information content. Longer, information-rich motif achieve better scores. Note that YY1 has the second longest motif (V$YY1_01), but this motif also has the second lowest information content, which likely explains its lower score compared to the most information rich motif (V$NRSF_Q4). C) ROC curves for all methods on the E2F4 dataset in the promoter benchmark. The V$E2F_Q2 motif is one of the least informative motifs and the performance of the prediction methods on the E2F4 dataset is relatively low. D) ROC curves for all methods on the NRSF dataset in the promoter benchmark. The V$NRSF_Q4 motif is the most informative motif and the NRSF dataset is among the highest scoring datasets.
Figure 4
Figure 4. Max PWM score and phyloP values correlate with center of peak regions.
The figures show a region of 500 bp surrounding each peak region. On the left is shown for each of the 500 positions the number of times that position has the maximum PWM score in the 500 bp region. On the right is the average phyloP score. The grey lines show the average peak width. Both max PWM score and higher phyloP values tend to be clustered in the center of the peak regions, but the clustering varies for each TF.
Figure 5
Figure 5. ROC scores for PWM and BBLS PWM on low and high peaks.
ROC scores on each TF promoter dataset for PWM and BBLS PWM methods on the lowest peaks (formula image percentile), and highest peaks (formula image percentile). The difference between PWM and the conservation-based BBLS PWM method is generally greater, and in favor of BBLS PWM, on the higher peaks more than the lower peaks.
Figure 6
Figure 6. Distribution of phyloP scores in lowest and highest peaks.
Boxplot showing for each TF the averaged phyloP scores in promoter peak regions on lowest peaks (formula image percentile), and highest peaks (formula image percentile). The higher peaks generally show higher sequence conservation across genomes.

References

    1. Elnitski L, Jin VX, Farnham PJ, Jones SJ. Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques. Genome Research. 2006;16:1455–1464. - PubMed
    1. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009;10:669–680. - PMC - PubMed
    1. Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5:276–287. - PubMed
    1. Tompa M, Li N, Bailey TL, Church GM, Moor BD, et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotech. 2005;23:137–144. - PubMed
    1. Sandve G, Abul O, Walseng V, Drablos F. Improved benchmarks for computational motif discovery. BMC Bioinformatics. 2007;8:193. - PMC - PubMed

Publication types

Substances