Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Mar 25;16(1):236.
doi: 10.1186/s12864-015-1419-2.

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

Affiliations

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

Rachid Ounit et al. BMC Genomics. .

Abstract

Background: The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.

Results: We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions.

Conclusions: CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/ .

PubMed Disclaimer

Figures

Figure 1
Figure 1
Classification performance of CLARK for several k-mer length and for various datasets.CLARK’s precision, sensitivity, assignment rate, average confidence scores and precision of high confidence assignments (HC) for several choices of the k-mer length on the “HiSeq” metagenomic dataset (a), the “MiSeq” metagenomic dataset (b), the “simBA-5” metagenomic dataset (c), the “simHC.20.500” metagenomic dataset (d), and barley unigenes (e). (a)(d) are results of the classification against the 695 genus-level targets.

References

    1. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004;304(5667):66–74. doi: 10.1126/science.1093857. - DOI - PubMed
    1. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, Chinwalla AT, et al. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–14. doi: 10.1038/nature11234. - DOI - PMC - PubMed
    1. The Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486(7402):215–21. doi: 10.1038/nature11209. - DOI - PMC - PubMed
    1. Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007;17(3):377–86. doi: 10.1101/gr.5969107. - DOI - PMC - PubMed
    1. Brady A, Salzberg S. PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat Methods. 2011;8(5):367. doi: 10.1038/nmeth0511-367. - DOI - PMC - PubMed

Publication types