Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 May 5;44(8):e72.
doi: 10.1093/nar/gkv1518. Epub 2015 Dec 31.

ConBind: motif-aware cross-species alignment for the identification of functional transcription factor binding sites

Affiliations

ConBind: motif-aware cross-species alignment for the identification of functional transcription factor binding sites

Stefan H Lelieveld et al. Nucleic Acids Res. .

Abstract

Eukaryotic gene expression is regulated by transcription factors (TFs) binding to promoter as well as distal enhancers. TFs recognize short, but specific binding sites (TFBSs) that are located within the promoter and enhancer regions. Functionally relevant TFBSs are often highly conserved during evolution leaving a strong phylogenetic signal. While multiple sequence alignment (MSA) is a potent tool to detect the phylogenetic signal, the current MSA implementations are optimized to align the maximum number of identical nucleotides. This approach might result in the omission of conserved motifs that contain interchangeable nucleotides such as the ETS motif (IUPAC code: GGAW). Here, we introduce ConBind, a novel method to enhance alignment of short motifs, even if their mutual sequence similarity is only partial. ConBind improves the identification of conserved TFBSs by improving the alignment accuracy of TFBS families within orthologous DNA sequences. Functional validation of the Gfi1b + 13 enhancer reveals that ConBind identifies additional functionally important ETS binding sites that were missed by all other tested alignment tools. In addition to the analysis of known regulatory regions, our web tool is useful for the analysis of TFBSs on so far unknown DNA regions identified through ChIP-sequencing.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Schematic diagram of the ConBind pipeline. (A) Identification of suitable orthologous DNA sequences (species B, C and D) for the user-supplied sequence (species A). Core orthologous subsequences are found using BLAST. (B) The orthologous DNA sequences are extended to match the length of the region of interest (black) and TFBSs of interest (supplied by user) are identified (C) Motif-aware alignment of the orthologous DNA sequences by using the candidate TFBSs information to optimize the MSA. Dashed lines represent gaps in the alignment.
Figure 2.
Figure 2.
Generation of motif-aware alignment. (A) TFBSs of the same motif family that are comprised of different nucleotides are often not aligned with each other by current MSA algorithms such as ClustalW2. Both, Sequence1 and Sequence2, contain the RUNT DNA binding motif (highlighted in yellow), but the sequence differs by one nucleotide (bold). On the right, the alignment produced by ClustalW2 is shown. Gaps are introduced inside the motif in order to maximize the overall alignment score. (B) Step-wise substitution of nucleotides: (1) The locations of the RUNT motif (yellow) are marked in four different sequences; (2) The letter for each nucleotide embodied in a motif is replaced by a new symbol carrying information about the original base type and the motif family; (3) The MSA is computed using the extended weight matrix (see panel C); (4) The symbols are replaced with the original base letters on the aligned sequences. (C) Left: the default identity matrix rewards the alignment of equal nucleotides per column, irrespective of their biological context. Right: the original identity matrix (red) is extended to take into account information about TF binding motifs. Nucleotides embodied in a motif are rewarded in a similar way to the default identity matrix when they match the original nucleotides (blue). Alignment of bases of the same motif family that differ between two or more sequences are rewarded by a MMW (Motif Match Weight) or MSW (Motif Mismatch Weight) depending on the original nucleotide.
Figure 3.
Figure 3.
Cost-effectiveness plane showing the performances of ConBind (CB) versus five other popular MSA methods, ClustalW2 (CW), T-Coffee (TC), ClustalOmega (CO), MUSCLE (MU) and MAFFT (MA). Effectiveness and Cost were computed using a set of 50 experimentally validated TFBSs. The method that performs best for the identification of functional TFBSs is ConBind with the highest Effectiveness and higher Effectiveness/Cost ratio than other MSA algorithms.
Figure 4.
Figure 4.
Identification of functional TFBSs in the Gfi1b + 13 enhancer. (A) Manual identification of functional TFBSs in the Gfi1b + 13 enhancer was performed as follows: (i) LiftOver of the mouse DNA sequence (mm9) to human (hg19), dog (canFam2), opossum (monDom5) and platypus (ornAna1) using UCSC (40). (ii) Alignment using ClustalW2 (12). (iii) Scoring conservation using GeneDoc (41). (iv) Manual search for TFBSs using Microsoft Word. ETS binding sites are shown in purple and pink, GFI binding sites in yellow, GATA motifs in green and EBOX motifs in blue. (B) Alignment of the Gfi1b + 13 enhancer using ConBind. Input data: Gfi1b+13 co-ordinates; (−)-strand; assembly: mm9, motifs: EBOX, ETS, GATA, GFI; species: human, dog, opossum, platypus. Output file was saved as msf-format in order to display conservation similarly to the ClustalW2 (12) alignment. The same genomic region (chr2:28,602,086–28,602,736, mm10) has been used for the manual alignment (A) as well as for the alignment using ConBind (B), but only the most conserved part of the enhancer is shown. Color scheme as in (A). (C) Comparison of different MSA methods for identification of the three ETS binding sites on the 3′ end of the Gfi1b + 13 enhancer. Annotated in parenthesis are the positions of TFBSs (in bp) relative to the start of the enhancer. Each bar represents a MSA method: ConBind (CB), ClustalW2 (CW), T-Coffee (TC), ClustalOmega (CO), MUSCLE (MU) and MAFFT (MA). The height of the bar shows the number of species aligned by each MSA method for each binding site (maximum of seven species). (D) Luciferase reporter assay in stably transfected 416b cells. All TFBSs of one motif family, e.g. all GATA motifs, were mutated at the same time by single nucleotide changes within each motif. The results are shown relative to the luciferase activity of the wild-type (WT) enhancer. Color scheme as in (A). t-test P-values: * ≤0.05, ** ≤0.01, *** ≤0.001. The exact P-values are as follows: SV40/luc = 6.69E-18; SV/luc/Gfi1b + 13_Gata = 3.25E-06; SV/luc/Gfi1b + 13_Gfi1 = 0.0027; SV/luc/Gfi1b + 13_Ebox = 0.812; SV/luc/Gfi1b + 13_1–2 = 0.014; SV/luc/Gfi1b + 13_Ets3–5 = 0.013.

References

    1. Chapman M.A., Charchar F.J., Kinston S., Bird C.P., Grafham D., Rogers J., Grutzner F., Graves J.A., Green A.R., Gottgens B. Comparative and functional analyses of LYL1 loci establish marsupial sequences as a model for phylogenetic footprinting. Genomics. 2003;81:249–259. - PubMed
    1. Donaldson I.J., Chapman M., Kinston S., Landry J.R., Knezevic K., Piltz S., Buckley N., Green A.R., Gottgens B. Genome-wide identification of cis-regulatory sequences controlling blood and endothelial development. Hum. Mol. Genet. 2005;14:595–601. - PubMed
    1. Rubin G.M., Yandell M.D., Wortman J.R., Gabor Miklos G.L., Nelson C.R., Hariharan I.K., Fortini M.E., Li P.W., Apweiler R., Fleischmann W., et al. Comparative genomics of the eukaryotes. Science. 2000;287:2204–2215. - PMC - PubMed
    1. Osawa M., Yamaguchi T., Nakamura Y., Kaneko S., Onodera M., Sawada K.-i., Jegalian A., Wu H., Nakauchi H., Iwama A. Erythroid expansion mediated by the Gfi-1B zinc finger protein: role in normal hematopoiesis. Blood. 2002;100:2769–2777. - PubMed
    1. Vassen L., Okayama T., Moroy T. Gfi1b:green fluorescent protein knock-in mice reveal a dynamic expression pattern of Gfi1b during hematopoiesis that is largely complementary to Gfi1. Blood. 2007;109:2356–2364. - PubMed

Publication types

MeSH terms

LinkOut - more resources