Comparative Study

. 2008 Jul;36(12):4137-48.

doi: 10.1093/nar/gkn361. Epub 2008 Jun 13.

Extracting sequence features to predict protein-DNA interactions: a comparative study

Qing Zhou¹, Jun S Liu

Affiliations

PMID: 18556756
PMCID: PMC2475627
DOI: 10.1093/nar/gkn361

Comparative Study

Extracting sequence features to predict protein-DNA interactions: a comparative study

Qing Zhou et al. Nucleic Acids Res. 2008 Jul.

. 2008 Jul;36(12):4137-48.

doi: 10.1093/nar/gkn361. Epub 2008 Jun 13.

Authors

Qing Zhou¹, Jun S Liu

Affiliation

¹ Department of Statistics, University of California, Los Angeles, CA 90095, USA. zhou@stat.ucla.edu

PMID: 18556756
PMCID: PMC2475627
DOI: 10.1093/nar/gkn361

Abstract

Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.

PubMed Disclaimer

Figures

**Figure 1.**
A regression tree with two interior and three terminal nodes. (A) The decision rules partition the feature space into three disjoint regions: {X₁ ≤ *c,X*₂ ≤ d},{X₁ ≤ *c,X*₂ > d} and {X₁ > c}. The mean parameters attached to these regions are ]. (B) The piece-wise constant function defined by the regression tree with c = 3, d = 2 and .

formula image — **Figure 1.**
A regression tree with two interior and three terminal nodes. (A) The decision rules partition the feature space into three disjoint regions: {X₁ ≤ *c,X*₂ ≤ d},{X₁ ≤ *c,X*₂ > d} and {X₁ > c}. The mean parameters attached to these regions are ]. (B) The piece-wise constant function defined by the regression tree with c = 3, d = 2 and .

**Figure 2.**
The posterior inclusion probability P_in of all the features in descending order in the BART model for the Oct4 data set.

**Figure 3.**
The histograms of the non-motif features (dark bars) and all the features (light bars) selected in (A) Step-SO and (B) boosting with 100 trees on the Oct4 data set. In Step-SO, selected features are classified into categories by regression P-values. In boosting, they are classified by their relative influence normalized to sum up to 100%.

**Figure 4.**
Sensitivity and false positive counts for the BART, boosting and Sox-Oct scan methods in discriminating Oct4-bound sequences in mouse ESCs and random upstream sequences.

**Figure 5.**
A hypothesis of competitive binding between Sox2 and Gata4/Nkx2.5. In undifferentiated ES cells, Sox2 binds to a regulatory sequence (bracket region) to repress a target gene, while Gata4 and Nkx2.5 are not expressed. Later upon differentiation, Gata4 and Nkx2.5, both highly expressed, out-compete Sox2 to bind to the same region, thus terminating the repression of the downstream gene.

See this image and copyright information in PMC

References

1. Stormo GD, Hartzell GW. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA. 1989;86:1183–1187. - PMC - PubMed
1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wooton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262:208–214. - PubMed
1. Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 1994;2:28–36. - PubMed
1. Liu X, Brutlag DL, Liu JS. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001;6:127–138. - PubMed
1. Roth FR, Hughes JD, Estep PE, Church GM. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole genome mRNA quantization. Nat. Biotechnol. 1998;16:939–945. - PubMed

Publication types

Actions
Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Extracting sequence features to predict protein-DNA interactions: a comparative study

Affiliation

Extracting sequence features to predict protein-DNA interactions: a comparative study

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous