Review

. 2006 Jul 19;34(12):3585-98.

doi: 10.1093/nar/gkl372. Print 2006.

Computational identification of transcriptional regulatory elements in DNA sequence

Debraj GuhaThakurta¹

Affiliations

PMID: 16855295
PMCID: PMC1524905
DOI: 10.1093/nar/gkl372

Review

Computational identification of transcriptional regulatory elements in DNA sequence

Debraj GuhaThakurta. Nucleic Acids Res. 2006.

. 2006 Jul 19;34(12):3585-98.

doi: 10.1093/nar/gkl372. Print 2006.

Author

Debraj GuhaThakurta¹

Affiliation

¹ Research Genetics Division, Rosetta Inpharmatics LLC, Merck & Co., Inc, 401 Terry Avenue North, Seattle, WA 98109, USA. debraj_guhathakurta@merck.com

PMID: 16855295
PMCID: PMC1524905
DOI: 10.1093/nar/gkl372

Abstract

Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.

PubMed Disclaimer

Figures

**Figure 1**
The IUPAC (International union of pure and applied chemistry) code for representing degenerate nucleotide sequence patterns.

**Figure 2**
(A) The collection of eight known Rox1-binding sites taken from SCPD (47). Scores of the sites are according to the PWM described in (C). (B) Alignment matrix and IUPAC representation of the eight Rox1-binding sites. The cells represent the number of times a base i is observed at position j in the alignment of sites. The frequencies, f_i,j, of base i at position j of the binding sites can be obtained by dividing the values in the cells of the alignment matrix by the total number of sites, e.g. f_C,1 = f_T,1 = 4/8 = 0.5. (C) PWM for scoring sequences. Each weight is given by log₂(f_i,j/P_i) (see text), where P_i is the probability of observing the base i in the data; here we have taken P_A = P_T = 0.32, and P_C = P_G = 0.18 (corresponding to the *S.cerevisiae* genome). A pseudocount of 1 was added to the alignment before deriving the weights. This matrix was used to score the sites in A. As an example, the score of the site in red (sequence CCAATTGTTTTG, score 13.87) is given by the summation of the scores that are circled in red. Note that the scores of the two consensus sequences, CCCATTGTTCTC and TCCATTGTTCTC are different because P_C ≠ P_T. (D) Sequence logo representation (187) of the alignments, visually showing the IC and conservation at each of the alignment positions. The IC of this matrix is 11.3 bits or 7.83 nats (Equation 1).

**Figure 3**
Predicted TF-binding sites in human–mouse conserved regions around the CKM (creatine kinase, muscle) gene. The genomic regions, along with 5 kb upstream and 2 kb downstream, of the CKM gene were extracted from the human and mouse genomes and aligned using the BLASTZ software (151). The BLASTZ alignments were then fed into the rVISTA program (153) through the website . Binding sites for several TFs that are known to regulate gene expression in the muscle tissue were then predicted on the human sequence using the PWM models available from the TRANSFAC database (12). The predicted sites can be dynamically viewed and clustered through the above website. For the purpose of this current figure, we required that at least two binding sites belonging to different TFs be present within a window of 100 nt. A cluster of sites was observed in the immediate 5′ upstream region of this gene (boxed). Percent conservation between the two sequences is shown; regions with ≥75% conservation are colored. The human gene structure is shown at the top in blue.

See this image and copyright information in PMC

Cited by

CLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design.
Zhang S, Chen Y. Zhang S, et al. PLoS One. 2016 Aug 3;11(8):e0160435. doi: 10.1371/journal.pone.0160435. eCollection 2016. PLoS One. 2016. PMID: 27487245 Free PMC article.
Accurate recognition of cis-regulatory motifs with the correct lengths in prokaryotic genomes.
Li G, Liu B, Xu Y. Li G, et al. Nucleic Acids Res. 2010 Jan;38(2):e12. doi: 10.1093/nar/gkp907. Epub 2009 Nov 11. Nucleic Acids Res. 2010. PMID: 19906734 Free PMC article.
Improving MEME via a two-tiered significance analysis.
Tanaka E, Bailey TL, Keich U. Tanaka E, et al. Bioinformatics. 2014 Jul 15;30(14):1965-73. doi: 10.1093/bioinformatics/btu163. Epub 2014 Mar 24. Bioinformatics. 2014. PMID: 24665130 Free PMC article.
SPIC: a novel similarity metric for comparing transcription factor binding site motifs based on information contents.
Zhang S, Zhou X, Du C, Su Z. Zhang S, et al. BMC Syst Biol. 2013;7 Suppl 2(Suppl 2):S14. doi: 10.1186/1752-0509-7-S2-S14. Epub 2013 Dec 17. BMC Syst Biol. 2013. PMID: 24564945 Free PMC article.
In Silico Prediction of Transcription Factor Collaborations Underlying Phenotypic Sexual Dimorphism in Zebrafish (Danio rerio).
Hosseini S, Schmitt AO, Tetens J, Brenig B, Simianer H, Sharifi AR, Gültas M. Hosseini S, et al. Genes (Basel). 2021 Jun 7;12(6):873. doi: 10.3390/genes12060873. Genes (Basel). 2021. PMID: 34200177 Free PMC article.

See all "Cited by" articles

References

1. Collins F.S., Green E.D., Guttmacher A.E., Guyer M.S. A vision for the future of genomics research. Nature. 2003;422:835–847. - PubMed
1. Waterston R.H., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. - PubMed
1. Chiaromonte F., Weber R.J., Roskin K.M., Diekhans M., Kent W.J., Haussler D. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 2003;68:245–254. - PubMed
1. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
1. Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational identification of transcriptional regulatory elements in DNA sequence

Affiliation

Computational identification of transcriptional regulatory elements in DNA sequence

Author

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases