Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Feb 11:10:56.
doi: 10.1186/1471-2105-10-56.

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach

Affiliations

TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach

Naryttza N Diaz et al. BMC Bioinformatics. .

Abstract

Background: Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning.

Results: Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp - 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments > or = 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained.

Conclusion: An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A sketch of the leave-one-out cross validation strategy adopted in this study. A genome is selected from the data set comprising 373 genomes and fragmented subsequently. The collection of genomic fragments is regarded as the test set from which each fragment is drawn and subsequently classified. Classification of each test fragment is carried out using the remaining 372 organisms as a reference.
Figure 2
Figure 2
Classification accuracy achieved for genomic fragments of different lengths. Bars depict detailed specificity and average values for specificity (Sp.), sensitivity (Sn.) and false negative rate (FNr.) for each fragment length on different taxonomic ranks. Each color represents a genomic fragment length.
Figure 3
Figure 3
Overall performance for reads and contigs for each taxonomic rank. Average sensitivity (Sn.), specificity (Sp.), and false negative rate (FNr.) achieved for reads and contigs at each taxonomic rank.
Figure 4
Figure 4
Classification accuracy obtained for TACOA and PhyloPythia. Sensitivity (top), specificity (middle) and false negative rate (bottom) achieved by TACOA and PhyloPythia for three different genomic fragment lengths and taxonomic ranks evaluated. Single read lengths are represented by fragments of length 800 bp and 1 Kbp and contigs by 10 Kbp long fragments. The accuracy achieved is depicted using green bars for TACOA and blue bars for PhyloPythia. The sensitivity and specificity charts are scaled between 0–100% and the false negative rate is scaled between 0–30%.
Figure 5
Figure 5
Distribution of taxonomic assignments for Thermoplasma acidophilum. Proportions of genomic fragments originating from the T. acidophilum genome that are misclassified into other taxonomic groups.

References

    1. Venter JC, Adams MD, Sutton GG, Kerlavage AR, Smith HO, Hunkapiller M. Shotgun sequencing of the human genome. Science. 1998;280:1540–1542. doi: 10.1126/science.280.5369.1540. - DOI - PubMed
    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci. 1997;74:5463–5467. doi: 10.1073/pnas.74.12.5463. - DOI - PMC - PubMed
    1. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. doi: 10.1126/science.7542800. - DOI - PubMed
    1. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43. doi: 10.1038/nature02340. - DOI - PubMed
    1. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF. Characterization of uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon. J Bacteriol. 1996;178:591–599. - PMC - PubMed

Publication types

LinkOut - more resources