Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005;6(2):R18.
doi: 10.1186/gb-2005-6-2-r18. Epub 2005 Jan 26.

Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach

Affiliations

Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach

Olivier Elemento et al. Genome Biol. 2005.

Abstract

We describe a powerful new approach for discovering globally conserved regulatory elements between two genomes. The method is fast, simple and comprehensive, without requiring alignments. Its application to pairs of yeasts, worms, flies and mammals yields a large number of known and novel putative regulatory elements. Many of these are validated by independent biological observations, have spatial and/or orientation biases, are co-conserved with other elements and show surprising conservation across large phylogenetic distances.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the FastCompare approach. (a) Determination of orthologous pairs of ORFs, and extraction of the associated upstream regions (data not shown). (b) For each k-mer (here CACGTGA), determination of the sets of ORFs that contain it in their upstream regions, in each species separately. The conservation score (hypergeometric p-values to assess the overlap between both sets) is then calculated. (c) Ranking of all k-mers on the basis of their conservation scores.
Figure 2
Figure 2
Distributions of conservation scores for actual (red) and randomized (black) data obtained when applying FastCompare to S. cerevisiae and S. bayanus. Both distributions were constructed using bin sizes of 5. The top portion of the figure is not shown for the purpose of presentation. The distributions show that high conservation scores are unlikely to be obtained from randomized data. Also, a large number of 7-mers on the tail of the distribution correspond to experimentally verified transcription-factor-binding sites in yeast.
Figure 3
Figure 3
Proportions of 7-mers supported by different types of independent biological data ((a) known motifs, (b) chromatin-IP, (c) functional enrichment, (d) under/overexpression, (e) TRANSFAC; windows of size 100 were used to construct the figures, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to S. cerevisiae and S. bayanus. (a-e) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.
Figure 4
Figure 4
Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to S. cerevisiae and S. bayanus. For each 7-mer, a median distance to ATG was calculated using the positions of matches upstream of S. cerevisiae genes within the conserved set for this 7-mer. The 8,170 median distances were then binned into 20-bp bins, and the resulting histogram was smoothed using a normal kernel. The median distances for several known binding sites in S. cerevisiae are also indicated (see Table 1).
Figure 5
Figure 5
Validation of the conservation scores obtained when applying FastCompare to C. elegans and C. briggsae. (a) Distributions of conservation scores for actual (red) and randomized (black) data, showing that high conservation scores are unlikely to be obtained by chance. Conservation scores for some known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. (b-d) Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to C. elegans and C. briggsae. (b-d) indicate that the frequency of support increases with conservation score as calculated by FastCompare.
Figure 6
Figure 6
Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to C. elegans and C. briggsae. For each 7-mer, a median distance to ATG was calculated using the positions of matches upstream of C. elegans genes within the conserved set for this 7-mer. The 8,170 median distances were then binned into 20-bp bins, and the resulting histogram was smoothed using a normal kernel. The median distances for several known binding sites in C. elegans are also indicated.
Figure 7
Figure 7
Validation of the conservation scores obtained when applying FastCompare to D. melanogaster and D. pseudoobscura. (a) Distributions of conservation scores for actual (red) and randomized (black) data, showing that high conservation scores are unlikely to be obtained from randomized data. Conservation scores for certain known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. (b, c) Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to D. melanogaster and D. pseudoobscura. (b, c) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.
Figure 8
Figure 8
Validation of the conservation scores obtained when applying FastCompare to H. sapiens and M. musculus. (a) Distributions of conservation scores for actual and randomized data, showing that high conservation scores are unlikely to be obtained by chance. Conservation scores for some known regulatory elements are also indicated. Both distributions were constructed using bin sizes of 5, and the top portion of the figure is not shown for the purpose of presentation. (b-d) Proportion of 7-mers supported by different types of independent biological data (using windows of size 100, see Materials and methods) as a function of the conservation score rank, obtained when applying FastCompare to H. sapiens and M. musculus. (b-d) strongly indicate that the frequency of support increases with conservation score as calculated by FastCompare.
Figure 9
Figure 9
Distribution of median distances to ATG of all 7-mers, obtained when applying FastCompare to H. sapiens and M. musculus. For each 7-mer, a median distance to ATG was calculated using the positions of matches upstream of H. sapiens genes within the conserved set for this 7-mer. The 8,170 median distances were then binned into 20-bp bins, and the resulting histogram was smoothed using a normal kernel. The median distances for several known binding sites in H. sapiens are also indicated.
Figure 10
Figure 10
Partial representation (most proximal region) of the aligned 1 kb upstream regions of the S. cerevisiae STE12 gene and its orthologs. (a) The highest scoring 7-mers found by FastCompare in a comparison between S. cerevisiae and S. bayanus are highlighted. FastCompare correctly predicts the conserved and experimentally verified binding sites for Mcm1, Matalpha2 and Ste12 (proximal) (see [8] for review). A more distal non-verified binding site for Ste12, and a RRPE site close to the distal Matalpha2 are conserved between the four species, and also predicted by FastCompare. FastCompare predicts several nonconserved sites in each species. For example, in S. cerevisiae, it identifies a Rox1-binding site overlapping with the second Ste12 site, and a putative Upc2-binding site. (b) Aligned 1 kb upstream region of the S. cerevisiae STE2 gene and its S. paradoxus ortholog only, with the same highlighted 7-mers as in (a). Since the two yeast species diverged very recently, the two upstream regions appear highly conserved. However, using the FastCompare output allows efficient selection of verified and putative binding sites. CER, S. cerevisiae; Bay, S. bayanus; Par, S. paradoxus; Mik, S. mikatae.

Similar articles

Cited by

References

    1. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science. 2002;298:799–804. doi: 10.1126/science.1075090. - DOI - PubMed
    1. Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23. doi: 10.1093/bioinformatics/16.1.16. - DOI - PubMed
    1. Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science. 2003;301:71–76. doi: 10.1126/science.1084337. - DOI - PubMed
    1. Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003;423:241–254. doi: 10.1038/nature01644. - DOI - PubMed
    1. Aparicio S, Morrison A, Gould A, Gilthorpe J, Chaudhuri C, Rigby P, Krumlauf R, Brenner S. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc Natl Acad Sci USA. 1995;92:1684–1688. - PMC - PubMed

Publication types

LinkOut - more resources