Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Aug;14(8):1562-74.
doi: 10.1101/gr.1953904. Epub 2004 Jul 15.

Clustering of DNA sequences in human promoters

Affiliations

Clustering of DNA sequences in human promoters

Peter C FitzGerald et al. Genome Res. 2004 Aug.

Abstract

We have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a set of 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limited number of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNA sequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups of related sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are known binding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which we named Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozak sequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequences indicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). Human mRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster are predominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is more abundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promoters that we suggest are important for regulating gene expression.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of the dinucleotides (CG, GC, TT, CA) from –1000 to 500 bp in 13,010 human promoters.
Figure 2
Figure 2
Clustering factor of each 8-mer DNA sequence plotted at the position of the most populated bin: All 32,896 8-mers (A); 8687 8-mers with a maximum bin containing ≥20 members (B); clustering factor from –1000 bp to 500 bp for the 6838 8-mers that contain a maximum bin with ≥20 members from one of the 1000 random seventh-order Markov model data sets (C); clustering factor for the 7471 8-mers that contain a maximum bin with ≥20 members from –2500 to –1000 bp (D); and clustering factor values from –1000 to 500 bp for the 8687 8-mers from Figure 2B based on a randomized translocation of the TSS of between 0 and 500 bp (E).
Figure 3
Figure 3
The probability term P = [–log10(1 – p)] for the 8687 8-mers with a maximum bin containing ≥20 members. The 159 DNA sequences above the line at P = 7, a one in 10 million (single sampling) chance of being random, were manually annotated.
Figure 4
Figure 4
The number of occurrences of each 32,896 DNA sequence in the 13,010 promoter sequences is plotted as a gray dot. The abundance of all 159 sequences with P ≥ 7 is plotted as black triangles.
Figure 5
Figure 5
Distribution (number of occurrences per bin as a function of position relative to the TSS) of the DNA 8-mer (ACCGGAAG) that shows the greatest clustering (A) and the 159th 8-mer (CCGCCTCC; B).
Figure 6
Figure 6
Distribution of the 5-mer CCAAT and the 9-mer RRCCAATSR (A) and the CCAAT consensus RRCCAATSR and the 15 single base variants of the central CCAAT (B).
Figure 7
Figure 7
Distribution of selected sequences (8-mers and consensus patterns). (A) Three SP1 (CCCGCCC, CCCCGCCC, CCCCGCCCC) sequences and a nonpeaking single base variation (CCCCCCCC). (B) Clus1 (TCTCGCGA) sequence. (C) Two USF (TCACGTGG, TCACGTGA) sequences. (D) Three (TGACGTCA, TGATGTCA, TTGCGTCA) CREB like sequences. (E) Strand-specific localization of the TATAAAD sequence. (F) Two variants (TATATAD and TATAAGD) of TATA, plus strand (+) only. (G) Three NRF-1 (CGCCTGCG, CGCGTGCG, CGCATGCG) sequences. (H) ETS core (CCGGAA), consensus sequence (VCCGGAARY), and a peaking (VGCGGAARY) and nonpeaking VCCGGAAYR variant.
Figure 8
Figure 8
Distribution of the Kozak octamer AGATGGCG on the plus strand (+) and minus strand (–).
Figure 9
Figure 9
Distribution of selected sequences from the TRANSFAC database that are underrepresented near the TSS, SRY (WWAACAAWA), and LYF1 (TTTGGGAGR; Ikaros; A); and uniformly distributed, Myb (AACKGNC), HSF2 (GAANNWTCK), and TRE (TGAGTCA; B). (C) The core promoter element Initiator, Inr (YYANWYY). (D) The core promoter element downstream promoter element, DPE (RGWCGTG).

References

    1. Ashburner, M. and Lewis, S. 2002. On ontologies for biologists: The Gene Ontology: Untangling the web. Novartis Found. Symp. 247: 66–80. - PubMed
    1. Bendall, A.J. and Molloy, P.L. 1994. Base preferences for DNA binding by the bHLH-Zip protein USF: Effects of MgCl2 on specificity and comparison with binding of Myc family members. Nucleic Acids Res. 22: 2801–2810. - PMC - PubMed
    1. Boyd, K.E. and Farnham, P.J. 1999. Coexamination of site-specific transcription factor binding and promoter activity in living cells. Mol. Cell. Biol. 19: 8393–8399. - PMC - PubMed
    1. Breathnach, R. and Chambon, P. 1981. Organization and expression of eucaryotic split genes coding for proteins. Annu. Rev. Biochem. 50: 349–383. - PubMed
    1. Brown, T.A. and McKnight, S.L. 1992. Specificities of protein–protein and protein–DNA interaction of GABP α and two newly defined ets-related proteins. Genes & Dev. 6: 2502–2512. - PubMed

WEB SITE REFERENCES

    1. http://genome.nci.nih.gov/publications/promoters; Supplemental data for this paper.
    1. http://transfac.gbf.de/TRANSFAC; the Transcription Factor Database.
    1. http://expression.gnf.org; GNF Gene Expression Atlas.
    1. http://genome.ucsc.edu/; UCSC Genome Bioinformatics site.
    1. http://dbtss.hgc.jp/index.html; database of TSS (DBTSS).

LinkOut - more resources