Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2006 Jul 19;34(12):3585-98.
doi: 10.1093/nar/gkl372. Print 2006.

Computational identification of transcriptional regulatory elements in DNA sequence

Affiliations
Review

Computational identification of transcriptional regulatory elements in DNA sequence

Debraj GuhaThakurta. Nucleic Acids Res. .

Abstract

Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The IUPAC (International union of pure and applied chemistry) code for representing degenerate nucleotide sequence patterns.
Figure 2
Figure 2
(A) The collection of eight known Rox1-binding sites taken from SCPD (47). Scores of the sites are according to the PWM described in (C). (B) Alignment matrix and IUPAC representation of the eight Rox1-binding sites. The cells represent the number of times a base i is observed at position j in the alignment of sites. The frequencies, fi,j, of base i at position j of the binding sites can be obtained by dividing the values in the cells of the alignment matrix by the total number of sites, e.g. fC,1 = fT,1 = 4/8 = 0.5. (C) PWM for scoring sequences. Each weight is given by log2(fi,j/Pi) (see text), where Pi is the probability of observing the base i in the data; here we have taken PA = PT = 0.32, and PC = PG = 0.18 (corresponding to the S.cerevisiae genome). A pseudocount of 1 was added to the alignment before deriving the weights. This matrix was used to score the sites in A. As an example, the score of the site in red (sequence CCAATTGTTTTG, score 13.87) is given by the summation of the scores that are circled in red. Note that the scores of the two consensus sequences, CCCATTGTTCTC and TCCATTGTTCTC are different because PC ≠ PT. (D) Sequence logo representation (187) of the alignments, visually showing the IC and conservation at each of the alignment positions. The IC of this matrix is 11.3 bits or 7.83 nats (Equation 1).
Figure 3
Figure 3
Predicted TF-binding sites in human–mouse conserved regions around the CKM (creatine kinase, muscle) gene. The genomic regions, along with 5 kb upstream and 2 kb downstream, of the CKM gene were extracted from the human and mouse genomes and aligned using the BLASTZ software (151). The BLASTZ alignments were then fed into the rVISTA program (153) through the website . Binding sites for several TFs that are known to regulate gene expression in the muscle tissue were then predicted on the human sequence using the PWM models available from the TRANSFAC database (12). The predicted sites can be dynamically viewed and clustered through the above website. For the purpose of this current figure, we required that at least two binding sites belonging to different TFs be present within a window of 100 nt. A cluster of sites was observed in the immediate 5′ upstream region of this gene (boxed). Percent conservation between the two sequences is shown; regions with ≥75% conservation are colored. The human gene structure is shown at the top in blue.

Similar articles

Cited by

References

    1. Collins F.S., Green E.D., Guttmacher A.E., Guyer M.S. A vision for the future of genomics research. Nature. 2003;422:835–847. - PubMed
    1. Waterston R.H., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. - PubMed
    1. Chiaromonte F., Weber R.J., Roskin K.M., Diekhans M., Kent W.J., Haussler D. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol. 2003;68:245–254. - PubMed
    1. Adams M.D., Celniker S.E., Holt R.A., Evans C.A., Gocayne J.D., Amanatides P.G., Scherer S.E., Li P.W., Hoskins R.A., Galle R.F., et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed
    1. Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed

Publication types