Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2008 Jul;36(12):4137-48.
doi: 10.1093/nar/gkn361. Epub 2008 Jun 13.

Extracting sequence features to predict protein-DNA interactions: a comparative study

Affiliations
Comparative Study

Extracting sequence features to predict protein-DNA interactions: a comparative study

Qing Zhou et al. Nucleic Acids Res. 2008 Jul.

Abstract

Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
A regression tree with two interior and three terminal nodes. (A) The decision rules partition the feature space into three disjoint regions: {X1c,X2d},{X1c,X2 > d} and {X1 > c}. The mean parameters attached to these regions are formula image]. (B) The piece-wise constant function defined by the regression tree with c = 3, d = 2 and formula image.
Figure 2.
Figure 2.
The posterior inclusion probability Pin of all the features in descending order in the BART model for the Oct4 data set.
Figure 3.
Figure 3.
The histograms of the non-motif features (dark bars) and all the features (light bars) selected in (A) Step-SO and (B) boosting with 100 trees on the Oct4 data set. In Step-SO, selected features are classified into categories by regression P-values. In boosting, they are classified by their relative influence normalized to sum up to 100%.
Figure 4.
Figure 4.
Sensitivity and false positive counts for the BART, boosting and Sox-Oct scan methods in discriminating Oct4-bound sequences in mouse ESCs and random upstream sequences.
Figure 5.
Figure 5.
A hypothesis of competitive binding between Sox2 and Gata4/Nkx2.5. In undifferentiated ES cells, Sox2 binds to a regulatory sequence (bracket region) to repress a target gene, while Gata4 and Nkx2.5 are not expressed. Later upon differentiation, Gata4 and Nkx2.5, both highly expressed, out-compete Sox2 to bind to the same region, thus terminating the repression of the downstream gene.

References

    1. Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA. 1989;86:1183–1187. - PMC - PubMed
    1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wooton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
    1. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994;2:28–36. - PubMed
    1. Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001;6:127–138. - PubMed
    1. Roth FR, Hughes JD, Estep PE, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole genome mRNA quantization. Nat. Biotechnol. 1998;16:939–945. - PubMed

Publication types