Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Sep 22;43(18):1038-48.
doi: 10.1152/physiolgenomics.00098.2011. Epub 2011 Jul 19.

Identifying functional single nucleotide polymorphisms in the human CArGome

Affiliations

Identifying functional single nucleotide polymorphisms in the human CArGome

Craig C Benson et al. Physiol Genomics. .

Abstract

Regulatory SNPs (rSNPs) reside primarily within the nonprotein coding genome and are thought to disturb normal patterns of gene expression by altering DNA binding of transcription factors. Nevertheless, despite the explosive rise in SNP association studies, there is little information as to the function of rSNPs in human disease. Serum response factor (SRF) is a widely expressed DNA-binding transcription factor that has variable affinity to at least 1,216 permutations of a 10 bp transcription factor binding site (TFBS) known as the CArG box. We developed a robust in silico bioinformatics screening method to evaluate sequences around RefSeq genes for conserved CArG boxes. Utilizing a predetermined phastCons threshold score, we identified 8,252 strand-specific CArGs within an 8 kb window around the transcription start site of 5,213 genes, including all previously defined SRF target genes. We then interrogated this CArG dataset for the presence of previously annotated common polymorphisms. We found a total of 118 unique CArG boxes harboring a SNP within the 10 bp CArG sequence and 1,130 CArG boxes with SNPs located just outside the CArG element. Gel shift and luciferase reporter assays validated SRF binding and functional activity of several new CArG boxes. Importantly, SNPs within or just outside the CArG box often resulted in altered SRF binding and activity. Collectively, these findings demonstrate a powerful approach to computationally define rSNPs in the human CArGome and provide a foundation for similar analyses of other TFBS. Such information may find utility in genetic association studies of human disease where little insight is known regarding the functionality of rSNPs.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Method of identifying functional single nucleotide polymorphism (SNPs) in the human CArGome. The schematic diagram delineates the computational approach and integration of datasets to discover conserved CArG boxes mapped to the human genome (hg19) using a 46-vertebrate species phastCons threshold score derived from a subset of validated serum response factor (SRF)-CArG binding sites in the mouse. Conserved CArG boxes were subsequently analyzed for the presence of SNPs using known variants from dbSNP v131.
Fig. 2.
Fig. 2.
Proximal RefSeq CArGome. Sequence logo of 8,252 conserved CArG boxes generated using a Perl script that analyzed ∼150 Mb around an 8 kb window of 18,925 RefSeq genes. The height of each stack (e.g., position 1) is measured in “bits” of information with each nucleotide ordered from the most frequent (C in position 1) to least frequent (G in position 1). Based on binding affinity of SRF to CArG boxes, we estimate there are at least 1,216 permutations of CArG in the human genome (53), a number we factored in our Perl script (see materials and methods).
Fig. 3.
Fig. 3.
Conserved CArG distance to transcription start site (TSS) and genomic region location. A: histogram depicting the frequency of conserved CArGs (y-axis) classified into bins based on the distance of the CArG box to the TSS (in kb, x-axis) of the closest RefSeq gene (represented by the dashed vertical line). B: distribution of conserved CArG elements around associated RefSeq genes.
Fig. 4.
Fig. 4.
Genomic Regions Enrichment of Annotations Tool (GREAT) enrichment analysis of theoretical SRF-target genes based on biological process Gene Ontology (GO) terms. Enrichment analysis was performed using the GREAT algorithm (http://great.stanford.edu/) for each of the 8,252 conserved CArG sequences identified with the in silico method. The GREAT algorithm associates each human gene (n = 17,578) with a regulatory domain in the human genome (hg19 assembly) and calculates the total fraction of the genome annotated with GO terms (e.g., myofibril assembly). The submitted sequences that fall in each annotated GO term region are counted as “hits.” A binomial test compares the expected number of hits in a genome region with the observed number of hits. Listed in the figure are the most significantly enriched biological process GO terms (out of 7,170) for the conserved CArG sequences. Expected and observed counts for each GO term are listed with binomial test P value.
Fig. 5.
Fig. 5.
UCSC Genome Browser visualization. The CArGome track displays the conserved CArG box in green. The Vertebrate Cons track represents a conservation histogram across 46 species. Top: ∼11,000 bp view of the sequence for gene KLF6 (a.k.a. Krüppel-like factor 6). Bottom: 23 bp, zoomed-in sequence view of a conserved CArG and rSNP (rs10795076) that changes the C nucleotide (red box) to an A nucleotide.
Fig. 6.
Fig. 6.
Altered SRF binding with SNPs in or near CArG boxes. In vitro translated (IVT) SRF was incubated with 32P-labeled wild-type (WT) or SNP mutant CArG boxes in close proximity to 8 putative SRF-dependent genes. SRF nucleoprotein complexes are indicated by the lower arrow. Verification of SRF binding was demonstrated by SRF antibody supershifting (SS) of the nucleoprotein complex (upper arrow). The known SRF target gene CNN1 was used as a positive control for SRF binding; human CNN1 does not have any known SNPs associated with its CArG box. Boldfaced gene names indicate CArGs that conform to the consensus sequence, while lightfaced gene names are CArG-like sequences that deviate from the consensus sequence by 1 bp. Seven of the putative SRF-target genes studied have the SNP located within their associated 10-bp CArG box. The final gene, ABHD5, has an SNP located 2 bp outside its' CArG box (highlighted with an asterisk). Data were replicated in an independent experiment. Exposure times were 24, 48, and 17 h, for each panel from left to right.
Fig. 7.
Fig. 7.
Altered promoter activity with SNPs in or near CArG boxes. Luciferase activity measured 24 h after transfection of the indicated reporters encompassing either WT or SNP sequence constructs into COS7 cells in the absence or presence of SRF-VP16 (see materials and methods). Data were normalized to a control Renilla reporter that was cotransfected in all samples. *Statistical significance between WT and SNP in the presence of SRF-VP16. Error bars show SD, n = 4. Similar results were found in independent transfections using another cell type. Boldfaced gene names indicate CArGs that conform to the consensus sequence, while lightfaced gene names are CArG-like sequences that deviate from the consensus sequence by 1 bp. 909 represents the empty vector control.

References

    1. Ameur A, Rada-Iglesias A, Komorowski J, Wadelius C. Identification of candidate regulatory SNPs by combination of transcription-factor-binding site prediction, SNP genotyping and haploChIP. Nucleic Acids Res 37: e85, 2009. - PMC - PubMed
    1. Andersen MC, Engstrom PG, Lithwick S, Arenillas D, Eriksson P, Lenhard B, Wasserman WW, Odeberg J. In silico detection of sequence variations modifying transcriptional regulation. PLoS Comput Biol 4: e5, 2008. - PMC - PubMed
    1. Boeva V, Surdez D, Guillon N, Tirode F, Fejes AP, Delattre O, Barillot E. De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis. Nucleic Acids Res 38: e126, 2010. - PMC - PubMed
    1. Buchwalter G, Gross C, Wasylyk B. Ets ternary complex transcription factors. Gene 324: 1–14, 2004. - PubMed
    1. Buckland PR. The importance and identification of regulatory polymorphisms and their mechanisms of action. Biochim Biophys Acta 1762: 17–28, 2006. - PubMed

Publication types

LinkOut - more resources