Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 1998 Nov;8(11):1202-15.
doi: 10.1101/gr.8.11.1202.

Predicting gene regulatory elements in silico on a genomic scale

Affiliations

Predicting gene regulatory elements in silico on a genomic scale

A Brazma et al. Genome Res. 1998 Nov.

Abstract

We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles. In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in some minimum number of sequences, and rated them on the basis of their over-representation. Among the highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites. Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters. Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by chance.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The distribution of all patterns (of unrestricted length) with at most one wild-card symbol in the regions −250 to −150 (upstream from the ORFs) and randomly chosen genomic regions of length 100 bp. Dots in graphs in the left correspond to patterns that occur in x sequences from the random regions (along horizontal axis) and y sequences from the upstream regions (vertical axis). In graphs on the right, the upstream regions are replaced by another set of random regions; therefore, these plots show the expected statistics if the regions are chosen at random. (Top row) All patterns with at least 10 occurrences. (Second row) Subset of top row with all patterns containing at least two characters C or G and not containing any of the substrings AAAA, TTTT, ATAT, or TATA. (Bottom two rows) Same plots as in the first two rows, but only including patterns with at most 200 occurrences in upstream or random regions (i.e., zoomed to the lower left corner).
Figure 2
Figure 2
The plots of the scores of the 30 best patterns found from the clusters of upstream sequences from genes with similar expression profiles and of random sets of the upstream sequences of the same size. The dotted line is the average score of the 30 best patterns found from the random sets of the respective sizes. For the sets of 30 sequences and more, the pattern scores from the random sets of the upstream sequences are stabilizing and are considerably lower than for 30 best pattern scores for the respective clusters.
Figure 3
Figure 3
Distribution of bases A and T in the neighborhood of the translation start points in yeast. (⋄) A; (+) T; (□) A or T. The sequences from the gene’s strand are aligned on the start codon ATG at the positions 1–3.
Figure 4
Figure 4
Discretizing the continuous measurement space. An example of time series that belongs to cluster C(5; 4; 8)(0000120).

References

    1. Brāzma A, Vilo J, Ukkonen E, Valtonen K. Proceedings of Fifth International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1997. Data mining for regulatory elements in yeast genome; pp. 65–74. - PubMed
    1. Brāzma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to automatic discovery of patterns in biosequences. J Comp Biol. 1998a;5:277–304. - PubMed
    1. Brāzma A, Jonassen I, Vilo J, Ukkonen E. Proceedings of the Fourth International Colloquium on Grammar Inference, Lecture Notes in Artificial Intelligence. Vol. 1433. New York, NY: Springer; 1998b. Pattern discovery in biosequences; pp. 255–270.
    1. Bucher P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990;212:563–578. - PubMed
    1. Cardon LR, Stormo GD. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992;223:159–170. - PubMed

Publication types

LinkOut - more resources