Binning sequences using very sparse labels within a metagenome
- PMID: 18442374
- PMCID: PMC2383919
- DOI: 10.1186/1471-2105-9-215
Binning sequences using very sparse labels within a metagenome
Abstract
Background: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity.
Results: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the >/= 10 reads datasets and comparable in the > or = 8 kb benchmark tests.
Conclusion: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.
Figures






Similar articles
-
CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision.BMC Bioinformatics. 2017 Dec 28;18(Suppl 16):571. doi: 10.1186/s12859-017-1967-3. BMC Bioinformatics. 2017. PMID: 29297295 Free PMC article.
-
Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing.J Biomed Biotechnol. 2008;2008:513701. doi: 10.1155/2008/513701. J Biomed Biotechnol. 2008. PMID: 18288261 Free PMC article.
-
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments.BMC Genomics. 2009 Dec 3;10 Suppl 3(Suppl 3):S10. doi: 10.1186/1471-2164-10-S3-S10. BMC Genomics. 2009. PMID: 19958473 Free PMC article.
-
Classification of metagenomic sequences: methods and challenges.Brief Bioinform. 2012 Nov;13(6):669-81. doi: 10.1093/bib/bbs054. Epub 2012 Sep 8. Brief Bioinform. 2012. PMID: 22962338 Review.
-
A review of neural networks for metagenomic binning.Brief Bioinform. 2025 Mar 4;26(2):bbaf065. doi: 10.1093/bib/bbaf065. Brief Bioinform. 2025. PMID: 40131312 Free PMC article. Review.
Cited by
-
Improving metagenomic binning results with overlapped bins using assembly graphs.Algorithms Mol Biol. 2021 May 4;16(1):3. doi: 10.1186/s13015-021-00185-6. Algorithms Mol Biol. 2021. PMID: 33947431 Free PMC article.
-
RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles.BMC Bioinformatics. 2011 Jan 31;12:41. doi: 10.1186/1471-2105-12-41. BMC Bioinformatics. 2011. PMID: 21281493 Free PMC article.
-
Community-wide analysis of microbial genome sequence signatures.Genome Biol. 2009;10(8):R85. doi: 10.1186/gb-2009-10-8-r85. Epub 2009 Aug 21. Genome Biol. 2009. PMID: 19698104 Free PMC article.
-
TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach.BMC Bioinformatics. 2009 Feb 11;10:56. doi: 10.1186/1471-2105-10-56. BMC Bioinformatics. 2009. PMID: 19210774 Free PMC article.
-
Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics.ISME J. 2011 May;5(5):918-28. doi: 10.1038/ismej.2010.180. Epub 2010 Dec 16. ISME J. 2011. PMID: 21160538 Free PMC article.
References
-
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y-H, Smith HOB. Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857. - DOI - PubMed
-
- Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature. 2006;443:950–955. doi: 10.1038/nature05192. - DOI - PubMed
-
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology. 2007;5:e77. doi: 10.1371/journal.pbio.0050077. - DOI - PMC - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous