Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May;16(5):656-68.
doi: 10.1101/gr.4866006. Epub 2006 Apr 10.

Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression

Affiliations

Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression

Mathieu Blanchette et al. Genome Res. 2006 May.

Abstract

The identification of regulatory regions is one of the most important and challenging problems toward the functional annotation of the human genome. In higher eukaryotes, transcription-factor (TF) binding sites are often organized in clusters called cis-regulatory modules (CRM). While the prediction of individual TF-binding sites is a notoriously difficult problem, CRM prediction has proven to be somewhat more reliable. Starting from a set of predicted binding sites for more than 200 TF families documented in Transfac, we describe an algorithm relying on the principle that CRMs generally contain several phylogenetically conserved binding sites for a few different TFs. The method allows the prediction of more than 118,000 CRMs within the human genome. A subset of these is shown to be bound in vivo by TFs using ChIP-chip. Their analysis reveals, among other things, that CRM density varies widely across the genome, with CRM-rich regions often being located near genes encoding transcription factors involved in development. Predicted CRMs show a surprising enrichment near the 3' end of genes and in regions far from genes. We document the tendency for certain TFs to bind modules located in specific regions with respect to their target genes and identify TFs likely to be involved in tissue-specific regulation. The set of predicted CRMs, which is made available as a public database called PReMod (http://genomequebec.mcgill.ca/PReMod), will help analyze regulatory mechanisms in specific biological systems.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the CRM prediction algorithm. TFBS predictions for different PWMs are shown with different geometric shapes and their size indicates the score of the hit. Hits from individual species are combined using a weighted average method to compute the “Aligned hits.” The most significant (up to five) aligned hits are considered as “Tags” for the corresponding region. The sum of the Tags scores is used to calculate a “Module score” using a statistical significance estimation. This operation is performed for each position of the human genome, for sliding windows of size 100, 200, 500, 1000, and 2000 bp.
Figure 2.
Figure 2.
Sensitivity and enrichment of pCRMs for various regions of interest. (A) Sensitivity of the module predictions at varying score threshold, with respect to likely regulatory regions. Along the y-axis is the fraction of the bases within known regulatory regions that are predicted to belong to a pCRM. Along the x-axis is the number of predicted modules above a given threshold. Regions of interest are: 1 kb upstream: regions upstream of the TSS of Known Genes (based on the UCSC Genome Browser); Transfac sites: a set of 1209 experimentally verified binding sites from Transfac 7.2, mapped onto the human genome; TRRD modules: a set of 601 experimentally verified regulatory modules from the TRRD database; GALA modules: a set of 93 modules for the GALA database; CpG islands (based on the UCSC Genome Browser annotation); 1 kb upstream: regions upstream of the TSS of Known Genes that are not annotated as CpG islands; HS sites: a set of DNaseI hypersensitive sites from the Encode regions. (B) The fold enrichment is computed as the ratio between the size of the intersection between modules and regions of interest and the expected intersection size if modules were randomly positioned in the genome. (C,D) The analogous data, but restricting our attention to non proximal regulatory regions, i.e., those located more than 1 kb away from the TSS of the closest gene.
Figure 3.
Figure 3.
Distribution of pCRMs along a region of chromosome 11. (A) A 10-megabase region from chromosome 11 is shown (coordinates 99, 308, 463–109, 308, 463). The position of the pCRMs (red) and the known genes (blue, from the UCSC Genome Browser) is shown. (B) A zoom in a 350-kilobase region containing the progesterone receptor gene (PGR) (coordinate 100, 400,000–100,750,000). The pCRM marked with an asterisk are those printed on our DNA microarray. (C) The composition of the Module M16589 is depicted as can be found in the PReMod database accompanying this study (http://genomequebec.mcgill.ca/PReMod). The position of the hits for five TRANSFAC matrices chosen as tags for this module is shown together with their individual scores.
Figure 4.
Figure 4.
Distribution of pCRMs relative to specific regions of genes. The genome was divided into several types of noncoding regions: upstream of a gene (dark blue), 5′ UTR (pink), 1st intron (yellow), internal introns (light blue), last intron (brown), 3′ UTR (aqua), and downstream region (dark blue). (A) For each type of region, the fraction of bases included in a pCRM is graphed as a function of the distance to a reference point. For upstream regions, 5′ UTR, and first intron the reference point is the gene’s TSS. For middle introns the closest 5′ or 3′ intron boundary is used. For the last intron, the 3′ UTR and the region 3′ of the last exon, the 3′ end of the mRNA is used. Note that the 3′ UTR is off the scale in A. (B) Same as in A, but different scales are used for the x- and y-axes in order to better show the characteristics of all regions.
Figure 5.
Figure 5.
Many TFs preferentially bind to specific regions relative to the TSS of their target genes. A heat map of the enrichment (represented as a Z-score) of a TF for different regions relative to TSSs is shown. Regions in red are highly enriched for binding sites for the given TF, while those in blue are depleted. The regions shown on the x-axis are as follows: >100kb upstream, pCRMs located more than 100 kb upstream from a TSS; 10–100kb upstream, pCRMs located >10 kb, but <100 kb upstream from a TSS; 1–10kb upstream, pCRMs located >1 kb but <10 kb upstream from a TSS; 0–1kb upstream, pCRMs located within 1 kb upstream of a TSS; 1kb 1st intron, intronic pCRMs located within 1 kb downstream of the TSS of a gene; 10kb 1st intron, intronic pCRMs located within 10 kb downstream of a TSS; intron, intronic pCRM located >10 kb from the TSS; 0–1kb down, pCRM located within 1 kb from the 3′ end of a gene; 1–10kb down, pCRM located >1 kb but <10 kb downstream from the 3′ end of a gene. See Methods for details on the computation of Z-scores.

Similar articles

Cited by

References

    1. Aerts S., Loo P.V., Thijs G., Moreau Y., Moor B.D., Loo P.V., Thijs G., Moreau Y., Moor B.D., Thijs G., Moreau Y., Moor B.D., Moreau Y., Moor B.D., Moor B.D.2003Computational detection of cis-regulatory modules. Bioinformatics (Suppl 2) 19II5–II14. - PubMed
    1. Aerts S., Loo P.V., Moreau Y., Moor B.D., Loo P.V., Moreau Y., Moor B.D., Moreau Y., Moor B.D., Moor B.D. A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes. Bioinformatics. 2004;20:1974–1976. - PubMed
    1. Alkema W.B.L., Johansson O., Lagergren J., Wasserman W.W., Johansson O., Lagergren J., Wasserman W.W., Lagergren J., Wasserman W.W., Wasserman W.W. MSCAN: Identification of functional clusters of transcription factor binding sites. Nucleic Acids Res. 2004;32:W195–W198. - PMC - PubMed
    1. Alonso C.R. Hox proteins: Sculpting body parts by activating localized cell death. Curr. Biol. 2002;12:R776–R778. - PubMed
    1. Bailey T.L., Noble W.S., Noble W.S.2003Searching for statistically significant regulatory modules. Bioinformatics 19 : II16–II25. - PubMed

Publication types