. 2015 Oct 23;10(10):e0140644.

doi: 10.1371/journal.pone.0140644. eCollection 2015.

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

David Koslicki¹, Saikat Chatterjee², Damon Shahrivar², Alan W Walker³, Suzanna C Francis⁴, Louise J Fraser⁵, Mikko Vehkaperä⁶, Yueheng Lan⁷, Jukka Corander⁸

Affiliations

¹ Dept of Mathematics, Oregon State University, Corvallis, United States of America.
² Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden.
³ Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom.
⁴ MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom.
⁵ Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom.
⁶ Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom.
⁷ Dept of Physics, Tsinghua University, Beijing, China.
⁸ Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.

PMID: 26496191
PMCID: PMC4619776
DOI: 10.1371/journal.pone.0140644

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

David Koslicki et al. PLoS One. 2015.

. 2015 Oct 23;10(10):e0140644.

doi: 10.1371/journal.pone.0140644. eCollection 2015.

Authors

David Koslicki¹, Saikat Chatterjee², Damon Shahrivar², Alan W Walker³, Suzanna C Francis⁴, Louise J Fraser⁵, Mikko Vehkaperä⁶, Yueheng Lan⁷, Jukka Corander⁸

Affiliations

¹ Dept of Mathematics, Oregon State University, Corvallis, United States of America.
² Dept of Communication Theory, KTH Royal Institute of Technology, Stockholm, Sweden.
³ Microbiology Group, Rowett Institute of Nutrition and Health, University of Aberdeen, Aberdeen, United Kingdom.
⁴ MRC Tropical Epidemiology Group, London School of Hygiene and Tropical Medicine, London, United Kingdom.
⁵ Illumina Cambridge Ltd., Chesterford Research Park, Essex, United Kingdom.
⁶ Dept of Electronic and Electrical Engineering, University of Sheffield, Sheffield, United Kingdom.
⁷ Dept of Physics, Tsinghua University, Beijing, China.
⁸ Dept of Mathematics and Statistics, University of Helsinki, Helsinki, Finland.

PMID: 26496191
PMCID: PMC4619776
DOI: 10.1371/journal.pone.0140644

Abstract

Motivation: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.

Results: There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.

Availability: An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: L.J.F. received funding in the form of salary from Illumina Cambridge Ltd. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials.

Figures

**Fig 1. A flow-chart of the ARK method.**

**Fig 2. Results for the random K-means clustering on the simulated data.**
Mean VD error at the genus level as a function of the number of clusters. Note the improvement that ARK contributes to each method.

**Fig 3. Results for the random K-means clustering on the simulated data.**
Mean execution time increase (factor given in comparison to running SEK or Quikr in the absence of ARK) as a function of number of clusters. The dashed line represents a line with slope 1.

**Fig 4. Comparison of the underlying algorithms with and without ARK.**
Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Mean VD error at the genus level. Included for comparison are results for RDP’s NBC (compare to Fig 2(b) of [3]).

**Fig 5. Comparison of the underlying algorithms with and without ARK.**
Results are for the random K-means clustering on the simulated data when fixing the number of clusters to 75. Boxplot of the individual simulated sample execution times. Mean execution times for Quikr and ARK Quikr were 1.75 seconds and 4.71 minutes, while for SEK and ARK SEK they were 21.26 seconds and 19.21 minutes respectively. Mean execution time for RDP’s NBC was 38.19 minutes.

**Fig 6. Total execution time for each method on the 28 samples of real biological data.**

**Fig 7. PCoA plots using the Jensen-Shannon divergence for RDP’s NBC.**

**Fig 8. PCoA plots using the Jensen-Shannon divergence for ARK SEK.**

**Fig 9. ARK Quikr PCoA plots (using the Jensen-Shannon divergence) on the real biological data.**
In this case, we have labeling by body site. Note the clustering.

**Fig 10. ARK Quikr PCoA plots (using the Jensen-Shannon divergence) on the real biological data.**
In this case, we have labeling by variable region. Note the clustering.

See this image and copyright information in PMC

References

1. Wang Q, Garrity GM, Tiedje JM, Cole JR. Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 2007;73(16):5261–5267. 10.1128/AEM.00062-07 - DOI - PMC - PubMed
1. Meinicke P, Aßhauer KP, Lingner T. Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics. 2011;27(12):1618–1624. 10.1093/bioinformatics/btr266 - DOI - PMC - PubMed
1. Koslicki D, Foucart S, Rosen G. Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing. Bioinformatics. 2013;29(17):2096–2102. 10.1093/bioinformatics/btt336 - DOI - PubMed
1. Ong SH, Kukkillaya VU, Wilm A, Lay C, Ho EXP, Low L, et al. Species Identification and Profiling of Complex Microbial Communities Using Shotgun Illumina Sequencing of 16S rRNA Amplicon Sequences. PLoS One. 2013;8(4):e60811 10.1371/journal.pone.0060811 - DOI - PMC - PubMed
1. Dröge J, Gregor I, McHardy A. Taxator-tk: Precise Taxonomic Assignment of Metagenomes by Fast Approximation of Evolutionary Neighborhoods. Bioinformatics. 2014;31(6):817–824. 10.1093/bioinformatics/btu745 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Affiliations

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous