. 2011 Mar 18;6(3):e18011.

doi: 10.1371/journal.pone.0018011.

Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Dongying Wu¹, Martin Wu, Aaron Halpern, Douglas B Rusch, Shibu Yooseph, Marvin Frazier, J Craig Venter, Jonathan A Eisen

Affiliations

Affiliation

¹ Department of Evolution and Ecology, University of California Davis Genome Center, University of California Davis, Davis, California, United States of America.

PMID: 21437252
PMCID: PMC3060911
DOI: 10.1371/journal.pone.0018011

Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Dongying Wu et al. PLoS One. 2011.

. 2011 Mar 18;6(3):e18011.

doi: 10.1371/journal.pone.0018011.

Authors

Dongying Wu¹, Martin Wu, Aaron Halpern, Douglas B Rusch, Shibu Yooseph, Marvin Frazier, J Craig Venter, Jonathan A Eisen

Affiliation

¹ Department of Evolution and Ecology, University of California Davis Genome Center, University of California Davis, Davis, California, United States of America.

PMID: 21437252
PMCID: PMC3060911
DOI: 10.1371/journal.pone.0018011

Abstract

Background: Most of our knowledge about the ancient evolutionary history of organisms has been derived from data associated with specific known organisms (i.e., organisms that we can study directly such as plants, metazoans, and culturable microbes). Recently, however, a new source of data for such studies has arrived: DNA sequence data generated directly from environmental samples. Such metagenomic data has enormous potential in a variety of areas including, as we argue here, in studies of very early events in the evolution of gene families and of species.

Methodology/principal findings: We designed and implemented new methods for analyzing metagenomic data and used them to search the Global Ocean Sampling (GOS) expedition data set for novel lineages in three gene families commonly used in phylogenetic studies of known and unknown organisms: small subunit rRNA and the recA and rpoB superfamilies. Though the methods available could not accurately identify very deeply branched ss-rRNAs (largely due to difficulties in making robust sequence alignments for novel rRNA fragments), our analysis revealed the existence of multiple novel branches in the recA and rpoB gene families. Analysis of available sequence data likely from the same genomes as these novel recA and rpoB homologs was then used to further characterize the possible organismal source of the novel sequences.

Conclusions/significance: Of the novel recA and rpoB homologs identified in the metagenomic data, some likely come from uncharacterized viruses while others may represent ancient paralogs not yet seen in any cultured organism. A third possibility is that some come from novel cellular lineages that are only distantly related to any organisms for which sequence data is currently available. If there exist any major, but so-far-undiscovered, deeply branching lineages in the tree of life, we suggest that methods such as those described herein currently offer the best way to search for them.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Phylogenetic tree of the RecA superfamily.**
All RecA sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RecA superfamily are shaded and given a name on the right. Five of the proposed subfamilies contained only GOS sequences at the time of our initial analysis (RecA-like SAR, Phage SAR1, Phage SAR2, Unknown 1 and Unknown 2) and are highlighted by colored shading. As noted on the tree and in the text, sequences from two Archaea that were released after our initial analysis group in the **Unknown 2 subfamily.**

**Figure 2. The largest assembly from the GOS data that encodes a novel RecA subfamily member (a representative of subfamily Unknown 2).**
This GOS assembly (ID 1096627390330) encodes 33 annotated genes plus 16 hypothetical proteins, including several with similarity to known archaeal genes (e.g., DNA primase, translation initiation factor 2, Table 2). The arrow indicates a novel *recA* homolog from the Unknown 2 subfamily (cluster ID 9).

**Figure 3. Phylogenetic tree of the RpoB superfamily.**
All RpoB sequences were grouped into clusters using the Lek algorithm. Representatives of each cluster that contained >2 members were then selected and aligned using MUSCLE. A phylogenetic tree was built by from this alignment using PHYML; bootstrap values are based on 100 replicas. The Lek cluster ID precedes each sequence accession ID. Proposed subfamilies in the RpoB superfamily are shaded and given a name on the right. The two novel RpoB clades that contain only GOS sequences are highlighted by the colored panels.

See this image and copyright information in PMC

References

1. Balch WE, Magrum LJ, Fox GE, Wolfe RS, Woese CR. An ancient divergence among the bacteria. J Mol Evol. 1977;9:305–311. - PubMed
1. Woese C, Fox G. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci USA. 1977;74:5088–5090. - PMC - PubMed
1. Fox GE, Stackebrandt E, Hespell RB, Gibson J, Maniloff J, et al. The phylogeny of prokaryotes. Science. 1980;209:457–463. - PubMed
1. Pace NR. A molecular view of microbial diversity and the biosphere. Science. 1997;276:734–740. - PubMed
1. Hugenholtz P, Pitulle C, Hershberger KL, Pace NR. Novel division level bacterial diversity in a Yellowstone hot spring. J Bacteriol. 1998;180:366–376. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- Dryad Digital Repository

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Affiliation

Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in marker gene phylogenetic trees

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources