. 2010 Feb 3;5(2):e8938.

doi: 10.1371/journal.pone.0008938.

The effect of orthology and coregulation on detecting regulatory motifs

Valerie Storms¹, Marleen Claeys, Aminael Sanchez, Bart De Moor, Annemieke Verstuyf, Kathleen Marchal

Affiliations

PMID: 20140085
PMCID: PMC2815771
DOI: 10.1371/journal.pone.0008938

The effect of orthology and coregulation on detecting regulatory motifs

Valerie Storms et al. PLoS One. 2010.

. 2010 Feb 3;5(2):e8938.

doi: 10.1371/journal.pone.0008938.

Authors

Valerie Storms¹, Marleen Claeys, Aminael Sanchez, Bart De Moor, Annemieke Verstuyf, Kathleen Marchal

Affiliation

¹ CMPG, Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Leuven, Belgium.

PMID: 20140085
PMCID: PMC2815771
DOI: 10.1371/journal.pone.0008938

Abstract

Background: Computational de novo discovery of transcription factor binding sites is still a challenging problem. The growing number of sequenced genomes allows integrating orthology evidence with coregulation information when searching for motifs. Moreover, the more advanced motif detection algorithms explicitly model the phylogenetic relatedness between the orthologous input sequences and thus should be well adapted towards using orthologous information. In this study, we evaluated the conditions under which complementing coregulation with orthologous information improves motif detection for the class of probabilistic motif detection algorithms with an explicit evolutionary model.

Methodology: We designed datasets (real and synthetic) covering different degrees of coregulation and orthologous information to test how well Phylogibbs and Phylogenetic sampler, as representatives of the motif detection algorithms with evolutionary model performed as compared to MEME, a more classical motif detection algorithm that treats orthologs independently.

Results and conclusions: Under certain conditions detecting motifs in the combined coregulation-orthology space is indeed more efficient than using each space separately, but this is not always the case. Moreover, the difference in success rate between the advanced algorithms and MEME is still marginal. The success rate of motif detection depends on the complex interplay between the added information and the specificities of the applied algorithms. Insights in this relation provide information useful to both developers and users. All benchmark datasets are available at http://homes.esat.kuleuven.be/~kmarchal/Supplementary_Storms_Valerie_PlosONE.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Overview of the test setup.**
Panel A presents the three different information spaces in which motif detection was assessed: the coregulation, the combined coregulation-orthology and the orthologous space. The coregulation space consists of a set of non-coding sequences from a reference species (Spec1 = REF) that each contain at least one motif site for a common TF (indicated by Gene 1 to Gene N). For the combined space, we extent the coregulation space with orthologous sequences selected from different species (indicated by Spec 2 to Spec M). One reference gene together with its orthologs is referred to as an *orthologous set* (indicated by a blue frame). The combined space thus consists of multiple orthologous sets while the orthologous space consists of a single orthologous set. We assessed the specific contribution of each space to the success rate of motif detection by performing the tests summarized in panels B and C. At first we tested the effect of adding different types of orthologous information as shown in Panel B. These tests involve changing the topology by which the orthologs are related (equal, unequal star and non star like topology), changing the mutual distance between the orthologs (represented by elongating the branches of the tree) and using datasets with a different number of orthologs. Secondly, the effect of altering the signal to noise ratio of the datasets on the accuracy of the results was tested 1) by changing the degree of degeneracy of the motifs and 2) by omitting motifs sites. We differentiate between leaving out motif sites in the coregulation direction versus their omission in the orthologous direction as is illustrated for a dataset in the combined space.

**Figure 2. Results for motif detection in the coregulation space.**
Each dataset consists of ten coregulated genes from the reference species (proximity 0.80). Panel A displays the results for a synthetic dataset in which all sequences contain a site sampled from a high IC motif (A). Panel B shows the results for a dataset in which all sequences contain a site sampled from a low IC motif (B) and panel C shows the results of a dataset where the motif site is missing in two out of ten sequences. The remainder of the sequences contains a motif site sampled from the high IC motif. Results were assessed by the performance measures D1: the number of datasets with an output out of 100 datasets, D1*RR: the number of datasets with a correct output and the quality measures PPV (the percentage of true sites among the predicted motif sites, averaged over all correct outputs) and Sens (the percentage of the true sites recovered by the algorithm, averaged over all correct outputs).

**Figure 3. Effect of adding orthologs with distinct phylogenetic distances on motif detection in the combined space.**
Results are displayed on the retrieval of a low IC motif in a synthetic dataset. Panel (A) shows the results for the coregulation space that consists of ten coregulated reference genes. The remaining panels represent the results for the combined space that consists of the ten coregulated reference genes together with their orthologs, also referred to as ten orthologous sets. Each orthologous set consists of five prealigned sequences related through an equal star topology: the reference sequence with proximity 0.80 and four equally distant sequences with proximities of respectively 0.90 (B), 0.50 (C) and 0.20 (D). For the measures D1, D1*RR, PPV and Sens see Figure 2.

**Figure 4. Effect of the number of added orthologs on motif detection in the combined space.**
Results on the retrieval of both a high and a low IC motif are displayed for the real datasets: 1) results from the Gamma-proteobacterial datasets are indicated as black curves and 2) those of the *Saccharomyces* dataset are indicated as gray curves. Results for the high IC motif are indicated by circles and correspond to those obtained for LexA (bacterial dataset) or URS1H (yeast dataset), results for the low IC motif are indicated by stars and correspond to those obtained for TyrR (bacterial dataset) or RAP1 (yeast dataset). The panels represent the results of a dataset containing for each coregulated reference gene two (A), four (B) and six (for the bacterial datasets) or five (for the yeast datasets) prealigned orthologs (the reference gene included) (C). Panel (D) represents the results of a dataset containing for each coregulated reference gene six or five unaligned orthologs (the reference gene included). Results were assessed by the F-value defined as the harmonic mean of the spPPV (the percentage of true sites amongst the predicted motif sites for the reference species, averaged over all correct outputs) and the spSens (the percentage of the true sites found by the algorithm for the reference species, averaged over all correct outputs). The reference species are respectively *E. coli* (bacterial data) or *S. cerevisiae* (yeast data). The Y-axis represents the difference between the F-value obtained from searching motifs in the combined coregulation-orthology space and the F-value obtained from searching in the coregulation space only.

**Figure 5. Effect of motif loss on motif detection in the combined space.**
The results are displayed for a synthetic dataset containing sites sampled from a high IC motif. Each dataset consists of ten coregulated reference genes complemented with their orthologs, also referred to as ten orthologous sets. Each orthologous set consists of five prealigned sequences related through an unequal star topology: four closely related orthologs with proximities of respectively 0.80 (reference ortholog), 0.90, 0.85 and 0.75 and one distantly related ortholog with a proximity of 0.20. Panel (A) represents the results when a motif site is present in all sequences of the orthologous sets. Panels (B) and (C) display the results when motif loss occurs in all sequences derived from respectively a closely (q = 0.75) or a distantly (q = 0.20) related species. Panel (D) shows the results when motif loss occurs in two out of ten coregulated reference genes and in all their corresponding orthologs. For the measures D1, RR*D1, PPV and Sens see Figure 2.

**Figure 6. Results for motif detection in the orthologous space.**
Results are displayed for a synthetic dataset with motif sites sampled from a high IC (on top) and a low IC motif (below). Each dataset consists of only one reference gene and its orthologs, referred to as one orthologous set. Panel (A) and (B) represent the results when the orthologous set contains respectively five and ten prealigned orthologs related through an equal star topology with a proximity of 0.50. Panel (C) represents the results when the orthologous set contains five prealigned orthologs related through an equal star topology with a proximity of 0.90 and panel (D) represents the results when the orthologous set contains five prealigned orthologs related through an unequal star topology. Note that for most tests the PPV equaled the Sens resulting in overlapping dots. For the measures D1, RR*D1, PPV and Sens see Figure 2.

See this image and copyright information in PMC

Cited by

A mutation degree model for the identification of transcriptional regulatory elements.
Zhang C, Wang J, Hua X, Fang J, Zhu H, Gao X. Zhang C, et al. BMC Bioinformatics. 2011 Jun 27;12:262. doi: 10.1186/1471-2105-12-262. BMC Bioinformatics. 2011. PMID: 21708002 Free PMC article.
Known and novel post-transcriptional regulatory sequences are conserved across plant families.
Vaughn JN, Ellingson SR, Mignone F, Arnim Av. Vaughn JN, et al. RNA. 2012 Mar;18(3):368-84. doi: 10.1261/rna.031179.111. Epub 2012 Jan 11. RNA. 2012. PMID: 22237150 Free PMC article.

References

1. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, et al. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci U S A. 2002;99:14250–14255. - PMC - PubMed
1. Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, et al. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics. 2006;7:57. - PMC - PubMed
1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304:66–74. - PubMed
1. Das MK, Dai HK. A survey of DNA motif finding algorithms. BMC Bioinformatics. 2007;8(Suppl 7):S21. - PMC - PubMed
1. Monsieurs P, Thijs G, Fadda AA, De Keersmaecker SC, Vanderleyden J, et al. More robust detection of motifs in coexpressed genes by using phylogenetic information. BMC Bioinformatics. 2006;7:160. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The effect of orthology and coregulation on detecting regulatory motifs

Affiliation

The effect of orthology and coregulation on detecting regulatory motifs

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases