Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002 Oct;12(10):1523-32.
doi: 10.1101/gr.323602.

Factors influencing the identification of transcription factor binding sites by cross-species comparison

Affiliations
Comparative Study

Factors influencing the identification of transcription factor binding sites by cross-species comparison

Lee Ann McCue et al. Genome Res. 2002 Oct.

Abstract

As the number of sequenced genomes has grown, the questions of which species are most useful and how many genomes are sufficient for comparison have become increasingly important for comparative genomics studies. We have systematically addressed these questions with respect to phylogenetic footprinting of transcription factor (TF) binding sites in the gamma-proteobacteria, and have evaluated the statistical significance of our motif predictions. We used a study set of 166 Escherichia coli genes that have experimentally identified TF binding sites upstream of the gene, with orthologous data from nine additional gamma-proteobacteria for phylogenetic footprinting. Just three species were sufficient for approximately 74.0% of the motif predictions to correspond to the experimentally reported E. coli sites, and important characteristics to consider when choosing species were phylogenetic distance, genome size, and natural habitat. We also performed simulations using randomized data to determine the critical maximum a posteriori probability (MAP) values for statistical significance of our motif predictions (P = 0.05). Approximately 60% of motif predictions containing sites from just three species had average MAP values above these critical MAP values. The inclusion of a species very closely related to E. coli increased the number of statistically significant motif predictions, despite substantially increasing the critical MAP value.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Phylogeny of the 10 γ-proteobacterial species inferred from 16S rRNA sequences (see Methods). Branch lengths are the calculated distance from Escherichia coli to each of the other species, measured as the number of expected nucleotide substitutions per site.
Figure 2
Figure 2
Boxplots representing the phylogenetic footprinting results of the study set for several species combinations: six combinations of two species, 15 combinations of three species, and 20 combinations of four species (see supplemental data for details). (A) The number of orthologous data sets. For combinations of two species, the upper boundary was the Escherichia coliSalmonella enterica serovar typhi (S. typhi) combination, with 161 data sets, and the lower boundary was the E. coliHaemophilus influenzae combination, with 72 data sets. (B) The percentage of motif predictions that included sites from all of the species in the data for each combination of species. For combinations of two species, the upper boundary was the E. coliS. typhi combination, at 98.8%, and the lower boundary was the E. coliPseudomonas aeruginosa combination, at 48.9%. (C) The percent correspondence with known transcription factor binding sites for each combination of species. For combinations of two species, the upper boundary was the E. coliYersinia pestis combination, at 66.2%, and the lower boundary was the E. coliP. aeruginosa combination, at 35.5%. The whiskers represent the species combinations with the highest and lowest numbers (A) or the highest and lowest percentages (B,C); the black boxes encompass the regions between the upper and lower quartiles, and the white lines indicate the medians.
Figure 3
Figure 3
The critical maximum a posteriori probability (MAP) values for the 95% quantile (P = 0.05) calculated from the simulations for randomized Escherichia coli data plus k additional sequences (1 ≤ k ≤ 9): All additional sequences were randomized (crosses); one sequence was added at 48% identity (on average) to the randomized E. coli sequence, and additional sequences were randomized (triangles); one sequence was added at 70% identity (on average) to the randomized E. coli sequence, and additional sequences were randomized (circles); one sequence was added at 48% identity (on average), another sequence was added at 70% identity (on average) to the randomized E. coli sequence, and additional sequences were randomized (plus symbols).

References

    1. Azam TA, Ishihama A. Twelve species of the nucleoid-associated protein from Escherichia coli: Sequence recognition specificity and DNA binding affinity. J Biol Chem. 1999;274:33105–33113. - PubMed
    1. Behr MA, Wilson MA, Gill WP, Salamon H, Schoolnik GK, Rane S, Small PM. Comparative genomics of BCG vaccines by whole-genome DNA microarray. Science. 1999;284:1520–1523. - PubMed
    1. Blanchette M, Tompa M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 2002;12:739–748. - PMC - PubMed
    1. Blanchette M, Schwikowski B, Tompa M. Algorithms for phylogenetic footprinting. J Comput Biol. 2002;9:211–223. - PubMed
    1. Cliften PF, Hillier LW, Fulton L, Graves T, Miner T, Gish WR, Waterston RH, Johnston M. Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Res. 2001;11:1175–1186. - PubMed

Publication types

MeSH terms

LinkOut - more resources