Probing metagenomics by rapid cluster analysis of very large datasets
- PMID: 18846219
- PMCID: PMC2557142
- DOI: 10.1371/journal.pone.0003375
Probing metagenomics by rapid cluster analysis of very large datasets
Abstract
Background: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.
Methodology/principal findings: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.
Conclusion/significance: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.
Conflict of interest statement
Figures







Similar articles
-
The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.PLoS Biol. 2007 Mar;5(3):e16. doi: 10.1371/journal.pbio.0050016. PLoS Biol. 2007. PMID: 17355171 Free PMC article.
-
On the quality of tree-based protein classification.Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12. Bioinformatics. 2005. PMID: 15647305
-
Incremental generation of summarized clustering hierarchy for protein family analysis.Bioinformatics. 2004 Nov 1;20(16):2586-96. doi: 10.1093/bioinformatics/bth290. Epub 2004 May 6. Bioinformatics. 2004. PMID: 15130937
-
Recent progresses in the application of machine learning approach for predicting protein functional class independent of sequence similarity.Proteomics. 2006 Jul;6(14):4023-37. doi: 10.1002/pmic.200500938. Proteomics. 2006. PMID: 16791826 Review.
-
Microbial metagenomics: beyond the genome.Ann Rev Mar Sci. 2011;3:347-71. doi: 10.1146/annurev-marine-120709-142811. Ann Rev Mar Sci. 2011. PMID: 21329209 Review.
Cited by
-
Composition-based classification of short metagenomic sequences elucidates the landscapes of taxonomic and functional enrichment of microorganisms.Nucleic Acids Res. 2013 Jan 7;41(1):e3. doi: 10.1093/nar/gks828. Epub 2012 Aug 31. Nucleic Acids Res. 2013. PMID: 22941634 Free PMC article.
-
Signal processing for metagenomics: extracting information from the soup.Curr Genomics. 2009 Nov;10(7):493-510. doi: 10.2174/138920209789208255. Curr Genomics. 2009. PMID: 20436876 Free PMC article.
-
Genome sequencing and transcriptome analysis of Trichoderma reesei QM9978 strain reveals a distal chromosome translocation to be responsible for loss of vib1 expression and loss of cellulase induction.Biotechnol Biofuels. 2017 Sep 7;10:209. doi: 10.1186/s13068-017-0897-7. eCollection 2017. Biotechnol Biofuels. 2017. PMID: 28912831 Free PMC article.
-
Distinct interacting core taxa in co-occurrence networks enable discrimination of polymicrobial oral diseases with similar symptoms.Sci Rep. 2016 Aug 8;6:30997. doi: 10.1038/srep30997. Sci Rep. 2016. PMID: 27499042 Free PMC article.
-
CD-HIT Suite: a web server for clustering and comparing biological sequences.Bioinformatics. 2010 Mar 1;26(5):680-2. doi: 10.1093/bioinformatics/btq003. Epub 2010 Jan 6. Bioinformatics. 2010. PMID: 20053844 Free PMC article.
References
-
- DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science. 2006;311:496–503. - PubMed