Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 15;33(14):2082-2088.
doi: 10.1093/bioinformatics/btx106.

Pseudoalignment for metagenomic read assignment

Affiliations

Pseudoalignment for metagenomic read assignment

L Schaeffer et al. Bioinformatics. .

Abstract

Motivation: Read assignment is an important first step in many metagenomic analysis workflows, providing the basis for identification and quantification of species. However ambiguity among the sequences of many strains makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to taxonomic levels where they are unambiguous. We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data in order to develop novel methods for rapid and accurate quantification of metagenomic strains.

Results: We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics projects.

Availability and implementation: Pipeline and analysis code can be downloaded from http://github.com/pachterlab/metakallisto.

Contact: lpachter@math.berkeley.edu.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Results of kallisto on simulated reads pseudoaligned to the ensembl dataset at the exact genome level. The solid line indicates the actual counts simulated from each strain, while circle and triangle markers indicate the counts estimated by kallisto. Triangles are read counts assigned to strains that aren’t actually present in the dataset
Fig. 2
Fig. 2
Results of kallisto (top), Bracken (middle) and CLARK (bottom) on simulated reads pseudoaligned to the ensembl dataset at the species level
Fig. 3
Fig. 3
Results of kallisto on bacterial reads in human saliva samples at all taxonomic levels

References

    1. Anders S., Huber W. (2010) Differential expression analysis for sequence count data. Genome Biol., 11, R106.. - PMC - PubMed
    1. Bolger A.M. et al. (2014) Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, btu170. - PMC - PubMed
    1. Bradley P. et al. (2015) Rapid antibiotic resistance predictions from genome sequence data for S. aureus and M. tuberculosis. Nat. Commun., 6, 10063. - PMC - PubMed
    1. Bray N. et al. (2015). Near-optimal RNA-Seq quantification. Nature Biotechnol, 34, 525–527. - PubMed
    1. Chen K., Pachter L. (2005) Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol., 1, 106–112. - PMC - PubMed