Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2008 Apr 28:9:215.
doi: 10.1186/1471-2105-9-215.

Binning sequences using very sparse labels within a metagenome

Affiliations
Comparative Study

Binning sequences using very sparse labels within a metagenome

Chon-Kit Kenneth Chan et al. BMC Bioinformatics. .

Abstract

Background: In metagenomic studies, a process called binning is necessary to assign contigs that belong to multiple species to their respective phylogenetic groups. Most of the current methods of binning, such as BLAST, k-mer and PhyloPythia, involve assigning sequence fragments by comparing sequence similarity or sequence composition with already-sequenced genomes that are still far from comprehensive. We propose a semi-supervised seeding method for binning that does not depend on knowledge of completed genomes. Instead, it extracts the flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds (labels) to assign other reads based on their compositional similarity.

Results: The proposed seeding method is implemented on an unsupervised Growing Self-Organising Map (GSOM), and called Seeded GSOM (S-GSOM). We compared it with four well-known semi-supervised learning methods in a preliminary test, separating random-length prokaryotic sequence fragments sampled from the NCBI genome database. We identified the flanking sequences of the highly conserved 16S rRNA as suitable seeds that could be used to group the sequence fragments according to their species. S-GSOM showed superior performance compared to the semi-supervised methods tested. Additionally, S-GSOM may also be used to visually identify some species that do not have seeds. The proposed method was then applied to simulated metagenomic datasets using two different confidence threshold settings and compared with PhyloPythia, k-mer and BLAST. At the reference taxonomic level Order, S-GSOM outperformed all k-mer and BLAST results and showed comparable results with PhyloPythia for each of the corresponding confidence settings, where S-GSOM performed better than PhyloPythia in the >/= 10 reads datasets and comparable in the > or = 8 kb benchmark tests.

Conclusion: In the task of binning using semi-supervised learning methods, results indicate S-GSOM to be the best of the methods tested. Most importantly, the proposed method does not require knowledge from known genomes and uses only very few labels (one per species is sufficient in most cases), which are extracted from the metagenome itself. These advantages make it a very attractive binning method. S-GSOM outperformed the binning methods that depend on already-sequenced genomes, and compares well to the current most advanced binning method, PhyloPythia.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example for preparing unlabelled input vectors and seeds. Unlabelled input vectors and seeds are prepared by avoiding the RNA sequences.
Figure 2
Figure 2
The S-GSOM algorithm. (a) Schematic diagram of the clustering process of S-GSOM; (b) The pseudo code for node assigning process in S-GSOM.
Figure 3
Figure 3
An overview of binning process using S-GSOM.
Figure 4
Figure 4
Identification of an appropriate Clustering Percentage (CP). Five datasets for each of 5, 10 and 20 species are randomly sampled. The averages of S-GSOM's clustering performance for the datasets are plotted against Clustering Percentage (CP) values. A trend of decreasing in clustering performance with increasing CP can be noted. A compromised value of CP = 55% is marked where both the number of assigned nodes and clustering performance are high.
Figure 5
Figure 5
Resulted GSOM maps of randomly sampled species. The figure illustrates the GSOM results of clustering sequence fragments according to species: (a) 10Sp_Set1, (b) 20Sp_Set1 and (c) 40Sp_Set1. Each hexagon represents a single node. If it only contains a single species, it is displayed in a colour that uniquely identifies the species. A node without a letter means that there is no sample located in it. The grey node represents two or more species in the node and the number of species is displayed on the node.
Figure 6
Figure 6
Illustration of exploring an unseeded cluster. (a) The 5-species S-GSOM map. The seeded nodes are shown with unique colours and labels. Nodes in charcoal colour represent nodes that will be assigned when CP = 27% and dark grey nodes at CP = 55%, light grey at CP = 77%, and white at CP = 100%. (b) Inter-node distance map with nodes assigned at CP = 55%.

Similar articles

Cited by

References

    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JFB. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. - DOI - PubMed
    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y-H, Smith HOB. Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science. 2004;304:66–74. doi: 10.1126/science.1093857. - DOI - PubMed
    1. Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EMB. Comparative Metagenomics of Microbial Communities. Science. 2005;308:554–557. doi: 10.1126/science.1107851. - DOI - PubMed
    1. Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature. 2006;443:950–955. doi: 10.1038/nature05192. - DOI - PubMed
    1. Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, et al. The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific. PLoS Biology. 2007;5:e77. doi: 10.1371/journal.pbio.0050077. - DOI - PMC - PubMed

Publication types

MeSH terms