. 2008;3(10):e3375.

doi: 10.1371/journal.pone.0003375. Epub 2008 Oct 10.

Probing metagenomics by rapid cluster analysis of very large datasets

Weizhong Li¹, John C Wooley, Adam Godzik

Affiliations

PMID: 18846219
PMCID: PMC2557142
DOI: 10.1371/journal.pone.0003375

Probing metagenomics by rapid cluster analysis of very large datasets

Weizhong Li et al. PLoS One. 2008.

. 2008;3(10):e3375.

doi: 10.1371/journal.pone.0003375. Epub 2008 Oct 10.

Authors

Weizhong Li¹, John C Wooley, Adam Godzik

Affiliation

¹ California Institute for Telecommunications and Information Technology, University of California San Diego, La Jolla, California, USA.

PMID: 18846219
PMCID: PMC2557142
DOI: 10.1371/journal.pone.0003375

Abstract

Background: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.

Methodology/principal findings: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.

Conclusion/significance: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Step-wise clustering of GOS ORFs.**

**Figure 2. Distribution of clusters of GOS ORFs and NCBI NR proteins.**
The x-axis is the size of a cluster defined by the number of non-redundant sequences at 90% identity. Blue bars with numbers plotted against the left y-axis in log scale show the numbers of clusters. Red line plotted against right y-axis show the number of corresponding ORFs or sequences. Left is for GOS, and right is for NCBI NR.

**Figure 3. Distribution of ORFs by length.**
The x-axis is the length bin of ORFs. The y-axis is number of ORFs in two groups: ORFs in predicted protein clusters and other ORFs.

**Figure 4. Pie chart of the predicted GOS protein clusters.**
The predicted GOS protein clusters are in three classes by similarities to existing protein sequences in NR: with homolog (BLASTP hits), with remote homolog (PDB-BLAST or FFAS hits), and novel (no hit). All matches to proteins in NR are considered in (a). Only matches to at least 20 non-redundant sequences in NR are included in (b).

**Figure 5. Distribution of clusters by their associated organisms and functional classes.**
The left figure shows the number of clusters by organisms at the level of main domains of life (Archea, Eucaryota, Bacteria, and Viral). For example, “A,B” means a cluster has only Archaea and Bacteria homologs. The right figure shows distributions by COG functional classes. Blue bars plotted against left y-axis show numbers of clusters. Red and green lines plotted against right y-axis are numbers of GOS ORFs and the underlying COG sequences multiplied by 40 for scaling. COG functional classes are: C, energy; D, cell division, chromosome partitioning; E, amino acid; F, nucleotide; G, carbohydrate; H, coenzyme; I, lipid; J, translation, ribosomal structure, and biogenesis; K, transcription; L, DNA replication, recombination, and repair; M, cell wall/membrane/envelope; N, cell motility and secretion; O, posttranslational modification, protein turnover, chaperones; P, inorganic ion; Q, secondary metabolites; R, general function prediction only; S, function unknown; and T, signal transduction.

**Figure 6. Distribution of predicted GOS protein clusters by their associated samples.**
The x-axis is the cluster size; the y-axis is the number of a cluster's associated samples. The pie chart inset shows distribution of clusters by the percentage of samples to which a cluster is associated.

**Figure 7. Distribution of predicted GOS protein clusters within each sample.**
The y-axis is the number of clusters. In the upper figure, clusters are grouped and colored by the percentage of samples to which a cluster is associated. In the bottom figure, clusters are colored by novelty in terms of having homologs, remote homologs, or no homolog in known protein database.

See this image and copyright information in PMC

References

1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biol. 2007;5:e77. - PMC - PubMed
1. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, et al. The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families. PLoS Biol. 2007;5:e16. - PMC - PubMed
1. Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, et al. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312:1355–1359. - PMC - PubMed
1. DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science. 2006;311:496–503. - PubMed
1. Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, et al. The marine viromes of four oceanic regions. PLoS Biol. 2006;4:e368. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Probing metagenomics by rapid cluster analysis of very large datasets

Affiliation

Probing metagenomics by rapid cluster analysis of very large datasets

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources