Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 26;7(5):e0075822.
doi: 10.1128/msystems.00758-22. Epub 2022 Sep 8.

Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References

Affiliations

Zebra: Static and Dynamic Genome Cover Thresholds with Overlapping References

Daniel Hakim et al. mSystems. .

Abstract

Assigning taxonomy remains a challenging topic in microbiome studies, due largely to ambiguity of reads which overlap multiple reference genomes. With the Web of Life (WoL) reference database hosting 10,575 reference genomes and growing, the percentage of ambiguous reads will only increase. The resulting artifacts create both the illusion of co-occurrence and a long tail end of extraneous reference hits that confound interpretation. We introduce genome cover, the fraction of reference genome overlapped by reads, to distinguish these artifacts. We show how to dynamically predict genome cover by read count and examine our model in Staphylococcus aureus monoculture. Our modeling cleanly separates both S. aureus and true contaminants from the false artifacts of reference overlap. We next introduce saturated genome cover, the true fraction of a reference genome overlapped by sample contents. Genome cover may not saturate for low abundance or low prevalence bacteria. We assuage this worry with examination of a large human fecal data set. By compositing the metric across like samples, genome cover saturates even for rare species. We note that it is a threshold on saturated genome cover, not genome cover itself, which indicates a spurious reference hit or distant relative. We present Zebra, a method to compute and threshold the genome cover metric across like samples, a recurrence to estimate genome cover and confirm saturation, and provide guidance for choosing cover thresholds in real world scenarios. Standalone genome cover and integration into Woltka are available: https://github.com/biocore/zebra_filter, https://github.com/qiyunzhu/woltka. IMPORTANCE Taxonomic assignment, assigning sequences to specific taxonomic units, is a crucial processing step in microbiome analyses. Issues in taxonomic assignment affect interpretation of what microbes are present in each sample and may be associated with specific environmental or clinical conditions. Assigning importance to a particular taxon relies strongly on independence of assigned counts. The false inclusion of thousands of correlated taxa makes interpretation ambiguous, leading to underconstrained results which cannot be reproduced. The importance sometimes attached to implausible artifacts such as anthrax or bubonic plague is especially problematic. We show that the Zebra filter retrieves only the nearest relatives of sample contents enabling more reproducible and biologically plausible interpretation of metagenomic data.

Keywords: metagenomics; microbiome; read filtering.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

FIG 1
FIG 1
Modeling genome cover by read count differentiates low abundance contaminants from overlapping references in S. aureus monocultures. Clusters (A) to (D) determined by thresholding of Staphylococcus warneri SG1. Red indicates the line of best fit. A reported slope of 1 with no residual would indicate a perfect model fit. Mean predicted cover calculated using assigned read count and genome length assuming fixed 150 bp read length. As the number of reads increases, measured cover asymptotically approaches the overlap between true sample content and the assigned reference genome.
FIG 2
FIG 2
Cumulative genome cover in iMSMS. Genome cover for each reference taxon is accumulated across shuffled iMSMS samples. Sample depth varies from 10^5 to 10^7, median 10^6 reads. (a) Number of taxa passing cover threshold as samples are accumulated. (b) Tracing of reads assigned to Yersinia pestis. (c) Accumulated cover of Yersinia pestis and related microbes by sample. (d) Accumulated cover of Bacillus anthracis and related microbes by sample.

References

    1. Evans JT, Denef VJ. 2020. To dereplicate or not to dereplicate? mSphere 5. doi:10.1128/mSphere.00971-19. - DOI - PMC - PubMed
    1. Breitwieser FP, Baker DN, Salzberg SL. 2018. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol 19:1–10. doi:10.1186/s13059-018-1568-0. - DOI - PMC - PubMed
    1. Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K. 2017. SLIMM: species level identification of microorganisms from metagenomes. PeerJ 5:e3138. doi:10.7717/peerj.3138. - DOI - PMC - PubMed
    1. Zhu Q, Mai U, Pfeiffer W, Janssen S, Asnicar F, Sanders JG, Belda-Ferre P, Al-Ghalith GA, Kopylova E, McDonald D, Kosciolek T, Yin JB, Huang S, Salam N, Jiao J-Y, Wu Z, Xu ZZ, Cantrell K, Yang Y, Sayyari E, Rabiee M, Morton JT, Podell S, Knights D, Li W-J, Huttenhower C, Segata N, Smarr L, Mirarab S, Knight R. 2019. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains bacteria and archaea. Nat Commun 10:5477. doi:10.1038/s41467-019-13443-4. - DOI - PMC - PubMed
    1. Gonzalez A, Navas-Molina JA, Kosciolek T, McDonald D, Vázquez-Baeza Y, Ackermann G, DeReus J, Janssen S, Swafford AD, Orchanian SB, Sanders JG, Shorenstein J, Holste H, Petrus S, Robbins-Pianka A, Brislawn CJ, Wang M, Rideout JR, Bolyen E, Dillon M, Caporaso JG, Dorrestein PC, Knight R. 2018. Qiita: rapid, web-enabled microbiome meta-analysis. Nat Methods 15:796–798. doi:10.1038/s41592-018-0141-9. - DOI - PMC - PubMed

Publication types

LinkOut - more resources