Predicting gene regulatory elements in silico on a genomic scale

A Brazma¹, I Jonassen, J Vilo, E Ukkonen

Affiliations

Affiliation

¹ European Molecular Biology Laboratory (EMBL) Outstation-Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

PMID: 9847082
PMCID: PMC310790
DOI: 10.1101/gr.8.11.1202

Predicting gene regulatory elements in silico on a genomic scale

A Brazma et al. Genome Res. 1998 Nov.

. 1998 Nov;8(11):1202-15.

doi: 10.1101/gr.8.11.1202.

Authors

A Brazma¹, I Jonassen, J Vilo, E Ukkonen

Affiliation

¹ European Molecular Biology Laboratory (EMBL) Outstation-Hinxton, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

PMID: 9847082
PMCID: PMC310790
DOI: 10.1101/gr.8.11.1202

Abstract

We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles. In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in some minimum number of sequences, and rated them on the basis of their over-representation. Among the highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites. Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters. Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by chance.

PubMed Disclaimer

Figures

**Figure 1**
The distribution of all patterns (of unrestricted length) with at most one wild-card symbol in the regions −250 to −150 (upstream from the ORFs) and randomly chosen genomic regions of length 100 bp. Dots in graphs in the *left* correspond to patterns that occur in x sequences from the random regions (along horizontal axis) and y sequences from the upstream regions (vertical axis). In graphs on the *right*, the upstream regions are replaced by another set of random regions; therefore, these plots show the expected statistics if the regions are chosen at random. (*Top row*) All patterns with at least 10 occurrences. (*Second row*) Subset of top row with all patterns containing at least two characters C or G and not containing any of the substrings AAAA, TTTT, ATAT, or TATA. (*Bottom two rows*) Same plots as in the first two rows, but only including patterns with at most 200 occurrences in upstream or random regions (i.e., zoomed to the lower left corner).

**Figure 2**
The plots of the scores of the 30 best patterns found from the clusters of upstream sequences from genes with similar expression profiles and of random sets of the upstream sequences of the same size. The dotted line is the average score of the 30 best patterns found from the random sets of the respective sizes. For the sets of 30 sequences and more, the pattern scores from the random sets of the upstream sequences are stabilizing and are considerably lower than for 30 best pattern scores for the respective clusters.

**Figure 3**
Distribution of bases A and T in the neighborhood of the translation start points in yeast. (⋄) A; (+) T; (□) A or T. The sequences from the gene’s strand are aligned on the start codon ATG at the positions 1–3.

**Figure 4**
Discretizing the continuous measurement space. An example of time series that belongs to cluster C(5; 4; 8)(0000120).

See this image and copyright information in PMC

References

1. Brāzma A, Vilo J, Ukkonen E, Valtonen K. Proceedings of Fifth International Conference on Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press; 1997. Data mining for regulatory elements in yeast genome; pp. 65–74. - PubMed
1. Brāzma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to automatic discovery of patterns in biosequences. J Comp Biol. 1998a;5:277–304. - PubMed
1. Brāzma A, Jonassen I, Vilo J, Ukkonen E. Proceedings of the Fourth International Colloquium on Grammar Inference, Lecture Notes in Artificial Intelligence. Vol. 1433. New York, NY: Springer; 1998b. Pattern discovery in biosequences; pp. 255–270.
1. Bucher P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990;212:563–578. - PubMed
1. Cardon LR, Stormo GD. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J Mol Biol. 1992;223:159–170. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting gene regulatory elements in silico on a genomic scale

Affiliation

Predicting gene regulatory elements in silico on a genomic scale

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases