Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2008 Aug;25(8):1659-67.
doi: 10.1093/molbev/msn115. Epub 2008 May 19.

Signature genes as a phylogenomic tool

Affiliations
Comparative Study

Signature genes as a phylogenomic tool

Bas E Dutilh et al. Mol Biol Evol. 2008 Aug.

Abstract

Gene content has been shown to contain a strong phylogenetic signal, yet its usage for phylogenetic questions is hampered by horizontal gene transfer and parallel gene loss and until now required completely sequenced genomes. Here, we introduce an approach that allows the phylogenetic signal in gene content to be applied to any set of sequences, using signature genes for phylogenetic classification. The hundreds of publicly available genomes allow us to identify signature genes at various taxonomic depths, and we show how the presence of signature genes in an unspecified sample can be used to characterize its taxonomic composition. We identify 8,362 signature genes specific for 112 prokaryotic taxa. We show that these signature genes can be used to address phylogenetic questions on the basis of gene content in cases where classic gene content or sequence analyses provide an ambiguous answer, such as for Nanoarchaeum equitans, and even in cases where complete genomes are not available, such as for metagenomics data. Cross-validation experiments leaving out up to 30% of the species show that approximately 92% of the signature genes correctly place the species in a related clade. Analyses of metagenomics data sets with the signature gene approach are in good agreement with the previously reported species distributions based on phylogenetic analysis of marker genes. Summarizing, signature genes can complement traditional sequence-based methods in addressing taxonomic questions.

PubMed Disclaimer

Figures

F<sc>IG</sc>. 1.—
FIG. 1.—
Definition of signature genes based on a partially unresolved phylogeny. For every species, presence (1) or absence (0) of 3 genes (OGs) is indicated. In this example, only OG1 is a signature for clade A, as it is present in clade A1, clade A2 and clade A3, but not in clade B. Although OG2 and OG3 are present in more species within clade A, they are not a signature for clade A because OG2 is not present in clade A1, and OG3 is present outside of clade A.
F<sc>IG</sc>. 2.—
FIG. 2.—
Amounts of signature genes identified in prokaryotic taxa. The unresolved phylogeny is based on a superalignment tree (Ciccarelli et al. 2006) where we collapsed nodes with a bootstrap value lower than 80% and removed the Eukaryota. Several node names used in this paper are indicated with gray boxes. Branch widths and colors indicate the number of signature genes found for each node (see legend).
F<sc>IG</sc>. 3.—
FIG. 3.—
The number of signature genes, perfect signature genes (coverage score 1), and signature genes with a coverage score cutoff of 0.75 found with increasing numbers of completely sequenced genomes. The genomes are added one by one, in order of appearance (according to www.ncbi.nlm.nih.gov/genomes). Initially, the number of signature genes increases almost linearly with the appearance of more genomes. The 60th genome, that of Streptomyces avermitilis, completes the signature-rich Streptomyces clade (Streptomyces coelicolor was the fourth genome), and causes a great jump in the number of both perfect and normal signature genes.
F<sc>IG</sc>. 4.—
FIG. 4.—
Phylogenetic distribution of 3 metagenomics data sets (Venter et al. 2004; Tringe et al. 2005). Pies (ac) are the total numbers of signature genes found for each clade (including subclades); pies (df) are the percentages of the total number signature genes that exist for each clade; pies (gi) are the percentages of sequences found with several phylogenetic markers in the original publications (averages of all measurements; taxa that were not in the reference tree are not shown). According to the phylogenetic marker-based analyses, all 3 metagenomics data sets were highly dominated by bacterial signature genes (farm soil: 72%; sea: 78%; and whale fall: 70%), archaeal signature genes were present in much lower percentages (farm soil: 0.05%; sea: 0.6%; and whale fall: 0.1%). These phylogenetically less informative clades are not shown in the charts. This analysis is based on STRING 6.3 OGs as the mapping of the metagenomics data sets was only available for that version (kindly provided by C. von Mering).

References

    1. Andersson JO. Lateral gene transfer in eukaryotes. Cell Mol Life Sci. 2005;62:1182–1197. - PMC - PubMed
    1. Brinkmann H, Philippe H. Archaea sister group of Bacteria? Indications from tree reconstruction artifacts in ancient phylogenies. Mol Biol Evol. 1999;16:817–825. - PubMed
    1. Brochier C, Gribaldo S, Zivanovic Y, Confalonieri F, Forterre P. Nanoarchaea: representatives of a novel archaeal phylum or a fast-evolving euryarchaeal lineage related to Thermococcales? Genome Biol. 2005;6:R42. - PMC - PubMed
    1. Charlebois RL, Doolittle WF. Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res. 2004;14:2469–2477. - PMC - PubMed
    1. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006;311:1283–1287. - PubMed

Publication types