. 2019 Jan 15;35(2):219-226.

doi: 10.1093/bioinformatics/bty611.

Metagenomic binning through low-density hashing

Yunan Luo¹, Yun William Yu^{2

3}, Jianyang Zeng⁴, Bonnie Berger³, Jian Peng¹

Affiliations

¹ Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³ Department of Mathematics and Computer Science and AI Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁴ Machine Learning and Computational Biology Group, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China and.

PMID: 30010790
PMCID: PMC6330020
DOI: 10.1093/bioinformatics/bty611

Metagenomic binning through low-density hashing

Yunan Luo et al. Bioinformatics. 2019.

. 2019 Jan 15;35(2):219-226.

doi: 10.1093/bioinformatics/bty611.

Authors

Yunan Luo¹, Yun William Yu^{2

3}, Jianyang Zeng⁴, Bonnie Berger³, Jian Peng¹

Affiliations

¹ Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA.
² Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
³ Department of Mathematics and Computer Science and AI Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
⁴ Machine Learning and Computational Biology Group, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China and.

PMID: 30010790
PMCID: PMC6330020
DOI: 10.1093/bioinformatics/bty611

Abstract

Motivation: Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning.

Results: On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges.

Availability and implementation: Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Low-density hashing with even coverage. **(a)** Random projections onto subspaces (left) cover all positions evenly only in expectation, and for small numbers of hash functions, will give uneven coverage. Using Gallager-inspired LDPC codes allows us to guarantee even coverage of all positions in the $k$ -mer (right) with a small number of hash functions. **(b)** Intuitively, one can think of a ( $k$ , $t$ )-hash function as a 0/1 vector of length $k$ with $t$ 1’s specifying the locations in the $k$ -mer that are selected. Given any ( $k$ , $t$ )-hash function h (e.g. the vector with $t$ 1’s followed by $k$ - $t$ 0’s), one can uniformly randomly construct another ( $k$ , $t$ )-hash function by permuting the entries of $h$ . The key to the Opal’s Gallager-inspired LSH design is that instead of starting with a single hash function and permuting it repeatedly, we start with a hash function *matrix* $H$ which is a LDPC matrix. $H$ is designed such that in the first row $h_{1}$ , the first $t$ entries are 1, in the second row $h_{2}$ , the second $t$ entries are 1 and so on, until each column of $H$ has exactly one 1. Permuting the columns of $H$ repeatedly generates random LSH functions that together cover all positions evenly, ensuring that we do not waste coding capacity on any particular position in the $k$ -mer. Additionally, for very long $k$ -mers, we can construct the Gallager LSH functions in a hierarchical way to further capture compositional dependencies from both local and global contexts (see Section 2). **(c)** The rows of $H$ are then used as hash functions

**Fig. 2.**
Comparison of Opal against compositional SVM-based approaches. On a synthetic dataset of fragments of length 200 drawn from a dataset of 50 bacterial species (Vervier *et al.*, 2016), using Opal LDPC hash functions (Opal-Gallagher) as features outperforms using the same method with uniformly random LSH functions (Opal-LSH), as well as using contiguous 16- and 12-mers, with **(a)** substitution errors and **(b)** indels. We note particularly good robustness against substitution errors

**Fig. 3.**
Comparison of Opal against Kraken, CLARK, CLARK-S, Kaiju and Metakallisto in terms of accuracy (top row), speed (middle row), and memory usage (last row): **(a)** Opal achieves generally higher classification accuracies on three public benchmark data sets than five other state-of-the-art compositional classifiers. Only CLARK-S is comparable in terms of accuracy, but CLARK-S uses an order of magnitude more memory while running significantly slower (processes fewer fragments/sec). **(b)** Opal has greater sensitivity to novel lineages in benchmarks on a large 193-species dataset [19]. We simulate the effect of novel species by removing a species from the dataset, and training at the genus level on the remaining data. Then, we predict the genus of simulated reads from the removed species. Similarly, we repeated the experiment removing all data from a genus, training at the phylum level, and attempting to predict the phylum of the removed genus. The improvement over Kaiju is particularly impressive as Kaiju bins using protein sequences, giving it an inherent advantage at higher phylogenetic levels, which we overcome using low-density hashing

See this image and copyright information in PMC

References

1. 1000 Genomes Project Consortium. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56–65. - PMC - PubMed
1. Altschul S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Alneberg J., et al. (2014) Binning metagenomic contigs by coverage and composition. Nat. Methods, 11, 1144. - PubMed
1. Ames S.K., et al. (2013) Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29, 2253–2260. - PMC - PubMed
1. Andoni A., Indyk P. (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimension. In Foundations of Computer Science, FOCS'06. 47th Annual IEEE Symposium on, pp. 459–468. IEEE.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 GM108348/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Metagenomic binning through low-density hashing

Affiliations

Metagenomic binning through low-density hashing

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources