Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;7(8):e41224.
doi: 10.1371/journal.pone.0041224. Epub 2012 Aug 21.

Genometa--a fast and accurate classifier for short metagenomic shotgun reads

Affiliations

Genometa--a fast and accurate classifier for short metagenomic shotgun reads

Colin F Davenport et al. PLoS One. 2012.

Abstract

Metagenomic studies use high-throughput sequence data to investigate microbial communities in situ. However, considerable challenges remain in the analysis of these data, particularly with regard to speed and reliable analysis of microbial species as opposed to higher level taxa such as phyla. We here present Genometa, a computationally undemanding graphical user interface program that enables identification of bacterial species and gene content from datasets generated by inexpensive high-throughput short read sequencing technologies. Our approach was first verified on two simulated metagenomic short read datasets, detecting 100% and 94% of the bacterial species included with few false positives or false negatives. Subsequent comparative benchmarking analysis against three popular metagenomic algorithms on an Illumina human gut dataset revealed Genometa to attribute the most reads to bacteria at species level (i.e. including all strains of that species) and demonstrate similar or better accuracy than the other programs. Lastly, speed was demonstrated to be many times that of BLAST due to the use of modern short read aligners. Our method is highly accurate if bacteria in the sample are represented by genomes in the reference sequence but cannot find species absent from the reference. This method is one of the most user-friendly and resource efficient approaches and is thus feasible for rapidly analysing millions of short reads on a personal computer.

Availability: The Genometa program, a step by step tutorial and Java source code are freely available from http://genomics1.mh-hannover.de/genometa/ and on http://code.google.com/p/genometa/. This program has been tested on Ubuntu Linux and Windows XP/7.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. A screenshot displaying key new features with a glacier ice metagenome dataset loaded .
An aligner can be run with the graphical dialogue (top right) against a reference sequence. Thereafter the resulting file format is converted to the standard BAM format and read in, revealing the number of reads mapped to each species in a sortable list which can be exported for further analysis (left). A bar graph graphically displays the number of reads attributed to each taxon. Clicking on a blue bar takes the user to a genome level view of the distribution of reads mapped against a taxon. Large datasets can thus be easily aligned, analysed and tested for plausibility from a graphical user interface.
Figure 2
Figure 2. Number of reads per species present in an in-house simulated ocean metagenome compared to the number of reads assigned to a reference containing all known strains by Genometa.
All bacterial species present were detected. Reads were retrieved in the same stoichiometric proportions in which they were inserted. Halobacterium sp NRC-1 was also detected, but this strain is colinear and practically identical to the included strain Halobacterium salinarum R1 .
Figure 3
Figure 3. Number of reads from an artifical metagenome of known composition (SimLC dataset; [19]) which were included in the metagenome (black bars) and assigned to the correct bacterial species by Genometa (blue bars).
Only the top 21 species of the 113 bacteria included in the dataset are shown. Genometa achieves a high accuracy on this dataset. Asterisks indicate strains which are included in the SimLC dataset but not in the Genometa reference sequence. Inter strain differences generally mean less reads are attributed to these taxa. The cross denotes a species which is not present in the Genometa reference sequence.
Figure 4
Figure 4. The number of 100,000 Illumina human gut 100 bp reads (SRR042027, Human Microbiome Project, [17]) assigned to bacterial species by four metagenomic programs.
Note the general agreement between the different programs but higher number of read assignments achieved by Genometa and MG-RAST. All programs found bacterial species typical of a human gut metagenome.

References

    1. Metzker ML (2010) Sequencing technologies - the next generation. Nat Rev Genet 11: 31–46. - PubMed
    1. Coetzee B, Freeborough M-J, Maree HJ, Celton JM, Rees DJG, et al. (2010) Deep sequencing analysis of viruses infecting grapevines: Virome of a vineyard. Virology 400: 157–163. - PubMed
    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, et al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464: 59–65. - PMC - PubMed
    1. Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H, et al. (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 331: 463–467. - PubMed
    1. Shah N, Tang H, Doak TG, Ye Y (2011) Comparing bacterial communities inferred from 16s rRNA gene sequencing and shotgun metagenomics. Pac Symp Biocomput 165–176. - PubMed

Publication types