Increasing coverage of transcription factor position weight matrices through domain-level homology

Brady Bernard¹, Vesteinn Thorsson, Hector Rovira, Ilya Shmulevich

Affiliations

PMID: 22952610
PMCID: PMC3428306
DOI: 10.1371/journal.pone.0042779

Increasing coverage of transcription factor position weight matrices through domain-level homology

Brady Bernard et al. PLoS One. 2012.

. 2012;7(8):e42779.

doi: 10.1371/journal.pone.0042779. Epub 2012 Aug 27.

Authors

Brady Bernard¹, Vesteinn Thorsson, Hector Rovira, Ilya Shmulevich

Affiliation

¹ Institute for Systems Biology, Seattle, Washington, United States of America.

PMID: 22952610
PMCID: PMC3428306
DOI: 10.1371/journal.pone.0042779

Abstract

Transcription factor-DNA interactions, central to cellular regulation and control, are commonly described by position weight matrices (PWMs). These matrices are frequently used to predict transcription factor binding sites in regulatory regions of DNA to complement and guide further experimental investigation. The DNA sequence preferences of transcription factors, encoded in PWMs, are dictated primarily by select residues within the DNA binding domain(s) that interact directly with DNA. Therefore, the DNA binding properties of homologous transcription factors with identical DNA binding domains may be characterized by PWMs derived from different species. Accordingly, we have implemented a fully automated domain-level homology searching method for identical DNA binding sequences.By applying the domain-level homology search to transcription factors with existing PWMs in the JASPAR and TRANSFAC databases, we were able to significantly increase coverage in terms of the total number of PWMs associated with a given species, assign PWMs to transcription factors that did not previously have any associations, and increase the number of represented species with PWMs over an order of magnitude. Additionally, using protein binding microarray (PBM) data, we have validated the domain-level method by demonstrating that transcription factor pairs with matching DNA binding domains exhibit comparable DNA binding specificity predictions to transcription factor pairs with completely identical sequences.The increased coverage achieved herein demonstrates the potential for more thorough species-associated investigation of protein-DNA interactions using existing resources. The PWM scanning results highlight the challenging nature of transcription factors that contain multiple DNA binding domains, as well as the impact of motif discovery on the ability to predict DNA binding properties. The method is additionally suitable for identifying domain-level homology mappings to enable utilization of additional information sources in the study of transcription factors. The domain-level homology search method, resulting PWM mappings, web-based user interface, and web API are publicly available at http://dodoma.systemsbiology.netdodoma.systemsbiology.net.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Data set used for validation of domain-level TF-DNA specificities.**
The top portion contains gene names, UniPROBE identifiers, and truncated position weight matrices for domain-identical transcription factor pairs (test set). The bottom portion contains completely-identical transcription factor pairs with replicate PBM data (control set). PID is the percent identity between the insert sequences of the transcription factor pairs used in the PBM experiments. Sequence logos were created using WebLogo .

**Figure 2. The number of position weight matrices for select organisms before and after homology mapping.**
The number of matrices that are initially associated with each organism is compared to the number following mapping of transcription factors with completely-identical sequences, as well as the increase following identical DNA binding domain-level mapping for the (A) JASPAR, (B) TRANSFAC, and (C) JASPAR & TRANSFAC databases. The JASPAR and TRANSFAC databases initially contained PWMs from 124 different species, compared to 1578 species following domain-level homology mapping. In particular, significantly increased PWM coverage is possible through domain-level mappings for the open-access JASPAR database.

**Figure 3. The number of unique transcription factors with position weight matrices (PWMs) resulting from domain-level homology mappings that did not previously have any associated PWMs.**
The number of unique factors resulting from mapping between completely-identical sequences is compared to the number of factors resulting from identical DNA binding domain-level mapping for the (A) JASPAR, (B) TRANSFAC, and (C) JASPAR & TRANSFAC databases. The number in parenthesis above each bar is the percentage increase above the initial annotated total number of unique transcription factors with PWMs. Significantly increased species-associated transcription factor coverage is enabled by domain-level mappings rather than the typical restriction to complete sequence matches.

**Figure 4. Spearman correlation coefficients (**
**) for position weight matrix (PWM) scanning of transcription factor pairs and their accompanying experimental protein binding microarray (PBM) fluorescence intensities.** Transcription factor pair groupings, as in Figure 1, were cross scans of completely-identical pairs (CCI), cross scans of domain-identical pairs (CDI), self scans of completely-identical pairs (SCI), and self scans of domain-identical pairs (SDI). Each point represents a PWM:PBM pairing as described in the Methods. The transcription factor Elf3 (UniPROBE identifiers UP00090 and UP00407) was an outlier with the lowest correlation coefficients. The lower correlation coefficients for these identifiers is likely due to the transcription factor Elf3 having two different DNA binding domains.

formula image — **Figure 4. Spearman correlation coefficients (**
**) for position weight matrix (PWM) scanning of transcription factor pairs and their accompanying experimental protein binding microarray (PBM) fluorescence intensities.** Transcription factor pair groupings, as in Figure 1, were cross scans of completely-identical pairs (CCI), cross scans of domain-identical pairs (CDI), self scans of completely-identical pairs (SCI), and self scans of domain-identical pairs (SDI). Each point represents a PWM:PBM pairing as described in the Methods. The transcription factor Elf3 (UniPROBE identifiers UP00090 and UP00407) was an outlier with the lowest correlation coefficients. The lower correlation coefficients for these identifiers is likely due to the transcription factor Elf3 having two different DNA binding domains.

**Figure 5. Self and cross Spearman correlation coefficients (**
**) between position weight matrix-based scores and experimental PBM fluorescence intensities.** The blue points are the completely-identical and domain-identical transcription factor pairs of Figure 1. The alignment of blue points along the gray diagonal line demonstrates the comparable performance of PWMs derived from completely-identical and domain-identical transcription factor pairs, whereas the magnitude of is an indication of how well the PWM captures the DNA binding properties of the transcription factor. As a point of comparison, the correlation coefficients for all other pairwise sets of transcription factors were calculated. The green points below the gray diagonal are indicative of PWMs from other transcription factors that failed to capture the DNA binding properties in the PBM data. Green points near the diagonal resulted from other transcription factors within the same domain family (*e.g.*, homeodomain) that have similar PWMs and, therefore, DNA binding properties. UniPROBE identifiers UP00017 and UP00389 were significantly outperformed by other PWMs in the data set (see text for details).

**Figure 6. The distribution of Spearman correlation coefficients for the domain-identical PWM and all other PWMs from the same homeodomain family for each TF from the test set in **Figure 1** .**
In each case, the correlation coefficient for the domain-identical PWM either clearly outperforms or is in the cluster of top performing PWMs, demonstrating that domain-identical PWMs capture the DNA sequence affinity and specificity of transcription factors better than considering the TF family alone.

**Figure 7. Average precision curves, calculated as the number of top n position weight matrix-based scores and experimental PBM fluorescence intensities in common.**
Precision curves were generate for cross scoring of completely-identical pairs (CCI), cross scoring of domain-identical pairs (CDI), self scoring of completely-identical pairs (SCI), and self scoring of domain-identical pairs (SDI) listed in Figure 1. The average precision is nearly exactly overlaying for CCI and SCI, as well as CDI and SDI, owing to the ability of self and cross PWM scans to equivalently capture the DNA binding properties in the PBM data. As with the Spearman correlation coefficients in Figure 4, the average precision for the domain-identical data set actually outperformed the completely-identical transcription factor pair scoring, reflecting the more challenging nature of the completely-identical data set (see text for details).

See this image and copyright information in PMC

References

1. Stormo GD (2000) DNA binding sites: representation and discovery. Bioinformatics 16: 16–23. - PubMed
1. Wasserman WW, Sandelin A (2004) Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 5: 276–287. - PubMed
1. Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins. Statisticalmechanical theory and application to operators and promoters. J Mol Biol 193: 723–750. - PubMed
1. Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, et al. (2010) JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res 38: D105–D110. - PMC - PubMed
1. Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. (2006) TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res 34: D108–10. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Increasing coverage of transcription factor position weight matrices through domain-level homology

Affiliation

Increasing coverage of transcription factor position weight matrices through domain-level homology

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources