Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Sep 30;18(9):e1010493.
doi: 10.1371/journal.pcbi.1010493. eCollection 2022 Sep.

HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Affiliations

HAYSTAC: A Bayesian framework for robust and rapid species identification in high-throughput sequencing data

Evangelos A Dimopoulos et al. PLoS Comput Biol. .

Abstract

Identification of specific species in metagenomic samples is critical for several key applications, yet many tools available require large computational power and are often prone to false positive identifications. Here we describe High-AccuracY and Scalable Taxonomic Assignment of MetagenomiC data (HAYSTAC), which can estimate the probability that a specific taxon is present in a metagenome. HAYSTAC provides a user-friendly tool to construct databases, based on publicly available genomes, that are used for competitive read mapping. It then uses a novel Bayesian framework to infer the abundance and statistical support for each species identification and provide per-read species classification. Unlike other methods, HAYSTAC is specifically designed to efficiently handle both ancient and modern DNA data, as well as incomplete reference databases, making it possible to run highly accurate hypothesis-driven analyses (i.e., assessing the presence of a specific species) on variably sized reference databases while dramatically improving processing speeds. We tested the performance and accuracy of HAYSTAC using simulated Illumina libraries, both with and without ancient DNA damage, and compared the results to other currently available methods (i.e., Kraken2/Bracken, KrakenUniq, MALT/HOPS, and Sigma). HAYSTAC identified fewer false positives than both Kraken2/Bracken, KrakenUniq and MALT in all simulations, and fewer than Sigma in simulations of ancient data. It uses less memory than Kraken2/Bracken, KrakenUniq as well as MALT both during database construction and sample analysis. Lastly, we used HAYSTAC to search for specific pathogens in two published ancient metagenomic datasets, demonstrating how it can be applied to empirical datasets. HAYSTAC is available from https://github.com/antonisdim/HAYSTAC.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. HAYSTAC’s workflow.
HAYSTAC consists of three main modules: (i) DATABASE, which builds a database of reference genomes from various user input sources; (ii) SAMPLE, which handles downloading sequencing files from the SRA and pre-processing of samples prior to analysis; and (iii) ANALYSE, which performs an analysis of a sample against a database by applying the mathematical model (see methods) for taxonomic abundance estimation.
Fig 2
Fig 2. Computational performance of HAYSTAC and other methods.
Linear regression of the elapsed time (wall clock) and peak memory usage (maximum resident set size) for a sample of size 1 million reads and reference databases containing either 10, 100 or 500 genomes, each with 5 replicates. When constructing the database, HAYSTAC uses substantially less memory and runs faster than either Kraken2/Bracken, KrakenUniq or MALT for restricted database sizes. When performing analyses, HAYSTAC uses less memory than Kraken2/Bracken, KrakenUniq or MALT, while its runtime was only marginally slower.
Fig 3
Fig 3. Accuracy of HAYSTAC and other methods for a simple simulation.
Bar plot showing the mean count of false positives (red), false negatives (orange), and true detected species (blue) for two versions of the simple simulations dataset, each with two replicates: (A) without human DNA contamination (n = 2); and (B) with human DNA contamination (n = 2). The dotted line shows the number of simulated species in each set of samples (i.e., the maximum true positive; n = 10), and numbers above the error bars indicate the mean species count in each category. For the simulation without human contamination, HAYSTAC outperforms Kraken2/Bracken, KrakenUniq and MALT, and performs equally with Sigma (i.e., no false positives or false negatives). For the simulation with human contamination, HAYSTAC outperforms all four other methods. All of the 5 false positive species identified by HAYSTAC are known to contain human sequences in their reference genomes [37], confounding any analyses which do not explicitly filter human contamination.
Fig 4
Fig 4. Accuracy of HAYSTAC and other methods for an oral microbiome simulation.
Bar plot showing the mean count of false positives (red), false negatives (orange), and true detected species (blue) for two versions of the oral microbiome dataset, each with six replicates: (A) modern simulation, with fixed read lengths (n = 6); and (B) ancient simulation, with variable read lengths and post-mortem damage (n = 6). The dotted line shows the average number of simulated species in each set of samples (i.e., the maximum true positive; n = 178), and numbers above the error bars indicate the mean species count in each category. For the modern simulation, HAYSTAC substantially outperforms Kraken2/Bracken, KrakenUniq and MALT with respect to false positives, and performs equivalently with Sigma. For the ancient simulation, HAYSTAC outperforms all four other methods with respect to false positives. The overall high rates of false negative identifications are due to the absence of many simulated species from the reference database for all four methods. HAYSTAC also outperforms all the other four methods in both the modern and ancient Oral Microbiome datasets by identifying the highest number of true positive species.
Fig 5
Fig 5. Accuracy of HAYSTAC and other methods using a reference database restricted to a single genus.
Bar plot, with a pseudo-log10 transformed y-axis, showing the mean count of false positives (red), false negatives (orange), and true detected species (blue) for nine different genera (Bacteroides, Burkholderia, Campylobacter, Clostridium, Corynebacterium, Desulfitobacterium, Mycobacterium, Solimonas and Streptococcus), each with 20 samples from the general and oral microbiome datasets. The dotted line shows the average number of simulated species in each set of samples (i.e., the maximum true positive; n = 2.0), and numbers above the error bars indicate the mean species count in each category. For the genus specific analysis, HAYSTAC substantially outperforms both Kraken2/Bracken and MALT with respect to false positives and performs better than Sigma.
Fig 6
Fig 6. HAYSTAC inferred posterior abundance levels.
Scatter plot showing the mean posterior abundances across all taxa (n = 362) and samples (n = 20) for either a genus specific database or the entire RefSeq representative database of prokaryotic species. Using a genus specific database has a small positive bias in mean posterior abundance for taxa within that genus (paired t-test p-value < 2.2·1016, mean of the differences = 3.9·106), nevertheless the overall abundance levels are highly correlated (R2 = 0.999). Computational runtime for the genus specific analyses are faster and use less memory, making genus specific analyses suitable for rapid initial screening (e.g. 1 million reads against a Corynebacterium specific database runs approximately 6.15 faster than against a database containing 500 species and uses approximately 3.9 times less memory).
Fig 7
Fig 7. Histogram of the number of assigned pathobiont simulated reads.
Histogram showing the read count frequency of true positive (blue) and false positive (red) pathobiont reads as identified by HAYSTAC and HOPS, after screening 200 spiked iterations of an ancient Oral Microbiome dataset sample (anc200e2repgn). HAYSTAC identifies robustly more pathobiont reads than HOPS, while producing less false positive identifications.
Fig 8
Fig 8. Posterior abundances of Yersinia species in Case Study 1.
Heatmap showing the mean posterior abundances for the seven RISE samples, based on a genus specific analysis of 18 Yersinia species. Yersinia pestis is the species with the highest posterior abundance, followed by Y. pseudotuberculosis, in agreement with the results of (Rasmussen et al., 2015).
Fig 9
Fig 9. Posterior abundances of oral microbiome species in Case Study 2.
Heatmap showing the mean posterior abundances for the 44 dental calculus samples, based on a custom database that combined the prokaryotic representative RefSeq and pathobionts with complete genomes from the following 5 genera: Corynebacterium, Haemophilus, Klebsiella, Streptococcus, and Bordetella. Species from these five genera of interest that naturally colonise the oral cavity can be found in more samples and at higher abundance (e.g. C. matruchotii) compared to pathobionts of the upper respiratory system (e.g. S. pneumoniae).

References

    1. Gonzalez A, Vázquez-Baeza Y, Pettengill JB, Ottesen A, McDonald D, Knight R. Avoiding Pandemic Fears in the Subway and Conquering the Platypus. mSystems. 2016;1(3). doi: 10.1128/mSystems.00050-16 - DOI - PMC - PubMed
    1. Tett A, Huang KD, Asnicar F, Fehlner-Peach H, Pasolli E, Karcher N, et al.. The Prevotella copri Complex Comprises Four Distinct Clades Underrepresented in Westernized Populations. Cell Host Microbe. 2019;26(5):666–679.e7. doi: 10.1016/j.chom.2019.08.018 - DOI - PMC - PubMed
    1. Ahn TH, Chai J, Pan C. Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics. 2015;31(2):170–177. doi: 10.1093/bioinformatics/btu641 - DOI - PMC - PubMed
    1. Wilson MR, Sample HA, Zorn KC, Arevalo S, Yu G, Neuhaus J, et al.. Clinical Metagenomic Sequencing for Diagnosis of Meningitis and Encephalitis. N Engl J Med. 2019;380(24):2327–2340. doi: 10.1056/NEJMoa1803396 - DOI - PMC - PubMed
    1. Spyrou MA, Bos KI, Herbig A, Krause J. Ancient pathogen genomics as an emerging tool for infectious disease research. Nat Rev Genet. 2019;20(6):323–340. doi: 10.1038/s41576-019-0119-1 - DOI - PMC - PubMed

Publication types

LinkOut - more resources