ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition
- PMID: 26496191
- PMCID: PMC4619776
- DOI: 10.1371/journal.pone.0140644
ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition
Abstract
Motivation: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.
Results: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.
Availability: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.
Conflict of interest statement
Figures










Similar articles
-
SEK: sparsity exploiting k-mer-based estimation of bacterial community composition.Bioinformatics. 2014 Sep 1;30(17):2423-31. doi: 10.1093/bioinformatics/btu320. Epub 2014 May 7. Bioinformatics. 2014. PMID: 24812337
-
MinION™ nanopore sequencing of environmental metagenomes: a synthetic approach.Gigascience. 2017 Mar 1;6(3):1-10. doi: 10.1093/gigascience/gix007. Gigascience. 2017. PMID: 28327976 Free PMC article.
-
Sediment-associated microbial community profiling: sample pre-processing through sequential membrane filtration for 16S rRNA amplicon sequencing.BMC Microbiol. 2022 Jan 20;22(1):33. doi: 10.1186/s12866-022-02441-0. BMC Microbiol. 2022. PMID: 35057747 Free PMC article.
-
DACE: a scalable DP-means algorithm for clustering extremely large sequence data.Bioinformatics. 2017 Mar 15;33(6):834-842. doi: 10.1093/bioinformatics/btw722. Bioinformatics. 2017. PMID: 28025198
-
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217. Cochrane Database Syst Rev. 2022. PMID: 36321557 Free PMC article.
Cited by
-
A survey of k-mer methods and applications in bioinformatics.Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec. Comput Struct Biotechnol J. 2024. PMID: 38840832 Free PMC article. Review.
-
Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.Nat Methods. 2017 Nov;14(11):1063-1071. doi: 10.1038/nmeth.4458. Epub 2017 Oct 2. Nat Methods. 2017. PMID: 28967888 Free PMC article.
-
The application of machine learning in clinical microbiology and infectious diseases.Front Cell Infect Microbiol. 2025 May 1;15:1545646. doi: 10.3389/fcimb.2025.1545646. eCollection 2025. Front Cell Infect Microbiol. 2025. PMID: 40375898 Free PMC article. Review.
-
Assessing taxonomic metagenome profilers with OPAL.Genome Biol. 2019 Mar 4;20(1):51. doi: 10.1186/s13059-019-1646-y. Genome Biol. 2019. PMID: 30832730 Free PMC article.
References
Publication types
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous