Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 15;35(2):219-226.
doi: 10.1093/bioinformatics/bty611.

Metagenomic binning through low-density hashing

Affiliations

Metagenomic binning through low-density hashing

Yunan Luo et al. Bioinformatics. .

Abstract

Motivation: Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning.

Results: On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges.

Availability and implementation: Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Low-density hashing with even coverage. (a) Random projections onto subspaces (left) cover all positions evenly only in expectation, and for small numbers of hash functions, will give uneven coverage. Using Gallager-inspired LDPC codes allows us to guarantee even coverage of all positions in the k-mer (right) with a small number of hash functions. (b) Intuitively, one can think of a (k, t)-hash function as a 0/1 vector of length k with t 1’s specifying the locations in the k-mer that are selected. Given any (k, t)-hash function h (e.g. the vector with t 1’s followed by k -t 0’s), one can uniformly randomly construct another (k, t)-hash function by permuting the entries of h. The key to the Opal’s Gallager-inspired LSH design is that instead of starting with a single hash function and permuting it repeatedly, we start with a hash function matrixH which is a LDPC matrix. H is designed such that in the first row h1, the first t entries are 1, in the second row h2, the second t entries are 1 and so on, until each column of H has exactly one 1. Permuting the columns of H repeatedly generates random LSH functions that together cover all positions evenly, ensuring that we do not waste coding capacity on any particular position in the k-mer. Additionally, for very long k-mers, we can construct the Gallager LSH functions in a hierarchical way to further capture compositional dependencies from both local and global contexts (see Section 2). (c) The rows of H are then used as hash functions
Fig. 2.
Fig. 2.
Comparison of Opal against compositional SVM-based approaches. On a synthetic dataset of fragments of length 200 drawn from a dataset of 50 bacterial species (Vervier et al., 2016), using Opal LDPC hash functions (Opal-Gallagher) as features outperforms using the same method with uniformly random LSH functions (Opal-LSH), as well as using contiguous 16- and 12-mers, with (a) substitution errors and (b) indels. We note particularly good robustness against substitution errors
Fig. 3.
Fig. 3.
Comparison of Opal against Kraken, CLARK, CLARK-S, Kaiju and Metakallisto in terms of accuracy (top row), speed (middle row), and memory usage (last row): (a) Opal achieves generally higher classification accuracies on three public benchmark data sets than five other state-of-the-art compositional classifiers. Only CLARK-S is comparable in terms of accuracy, but CLARK-S uses an order of magnitude more memory while running significantly slower (processes fewer fragments/sec). (b) Opal has greater sensitivity to novel lineages in benchmarks on a large 193-species dataset [19]. We simulate the effect of novel species by removing a species from the dataset, and training at the genus level on the remaining data. Then, we predict the genus of simulated reads from the removed species. Similarly, we repeated the experiment removing all data from a genus, training at the phylum level, and attempting to predict the phylum of the removed genus. The improvement over Kaiju is particularly impressive as Kaiju bins using protein sequences, giving it an inherent advantage at higher phylogenetic levels, which we overcome using low-density hashing

References

    1. 1000 Genomes Project Consortium. (2012) An integrated map of genetic variation from 1, 092 human genomes. Nature, 491, 56–65. - PMC - PubMed
    1. Altschul S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Alneberg J., et al. (2014) Binning metagenomic contigs by coverage and composition. Nat. Methods, 11, 1144. - PubMed
    1. Ames S.K., et al. (2013) Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics, 29, 2253–2260. - PMC - PubMed
    1. Andoni A., Indyk P. (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimension. In Foundations of Computer Science, FOCS'06. 47th Annual IEEE Symposium on, pp. 459–468. IEEE.

Publication types