. 2020 Oct 15;21(1):459.

doi: 10.1186/s12859-020-03802-0.

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Jonathan Louis Golob¹, Samuel Schwartz Minot²

Affiliations

¹ Infectious Diseases, Internal Medicine, Michigan Medicine, University of Michigan, Ann Arbor, MI, USA.
² Microbiome Research Initiative, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, E4-100, Seattle, WA, 98109-1024, USA. sminot@fredhutch.org.

PMID: 33059593
PMCID: PMC7559173
DOI: 10.1186/s12859-020-03802-0

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Jonathan Louis Golob et al. BMC Bioinformatics. 2020.

. 2020 Oct 15;21(1):459.

doi: 10.1186/s12859-020-03802-0.

Authors

Jonathan Louis Golob¹, Samuel Schwartz Minot²

Affiliations

¹ Infectious Diseases, Internal Medicine, Michigan Medicine, University of Michigan, Ann Arbor, MI, USA.
² Microbiome Research Initiative, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, E4-100, Seattle, WA, 98109-1024, USA. sminot@fredhutch.org.

PMID: 33059593
PMCID: PMC7559173
DOI: 10.1186/s12859-020-03802-0

Abstract

Background: High-throughput sequencing can establish the functional capacity of a microbial community by cataloging the protein-coding sequences (CDS) present in the metagenome of the community. The relative performance of different computational methods for identifying CDS from whole-genome shotgun sequencing is not fully established.

Results: Here we present an automated benchmarking workflow, using synthetic shotgun sequencing reads for which we know the true CDS content of the underlying communities, to determine the relative performance (sensitivity, positive predictive value or PPV, and computational efficiency) of different metagenome analysis tools for extracting the CDS content of a microbial community. Assembly-based methods are limited by coverage depth, with poor sensitivity for CDS at < 5X depth of sequencing, but have excellent PPV. Mapping-based techniques are more sensitive at low coverage depths, but can struggle with PPV. We additionally describe an expectation maximization based iterative algorithmic approach which we show to successfully improve the PPV of a mapping based technique while retaining improved sensitivity and computational efficiency.

Conclusion: Our benchmarking approach reveals the trade-offs of assembly versus alignment-based approaches and the relative performance of specific implementations when one wishes to extract the protein coding capacity of microbial communities.

Keywords: Bioinformatics; Metagenomics; Microbiome.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to disclose.

Figures

**Fig. 1**
Positive predictive value (PPV), sensitivity, and uniqueness of CDS calls by metagenomic analysis approaches. The positive predictive value (true positive over true positive plus false positive), sensitivity (true positive over true positive plus false negative) both overall and subsetted to CDS with 0–5x coverage, and uniqueness (true positive over true positive plus duplicates) on a per-CDS basis with different analysis approaches

**Fig. 2**
Sensitivity and uniqueness of CDS calls with respect to CDS coverage depth. Mapping based approaches are both more sensitive, and achieve a plateau of sensitivity at a lower coverage depth as compared to assembly-based methods

**Fig. 3**
The problem of multiply-mapping short-reads, and the FAMLI algorithm schematized. Three hundred and sixty simulated reads were generated from three CDS. These simulated read was aligned against the UniRef100 database, and all CDS with an alignment within 10% identity of the best match were retained. a The read-depth coverage of the three true peptides (top) b evenness filtering is used to remove the least likely to be present references from being considered. The left column is three randomly selected references that are successfully filtered at this step, the right three false references that are not filtered. c The iterative likelihood-based filtering of one randomly selected synthetic read. Each circle represents one remaining aligned reference CDS for this read; the true positive origin reference is in dark green. The length of each line is proportional to the calculated score at this iteration. d The number of CDS per read as a violin plot. After the tenth iteration, only one reference CDS (the correct) remains for this read

See this image and copyright information in PMC

References

1. NIH HMP Working Group. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. The NIH human microbiome project. Genome Res. 2009;19:2317–2323. doi: 10.1101/gr.096651.109. - DOI - PMC - PubMed
1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. doi: 10.1038/nature08821. - DOI - PMC - PubMed
1. Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012;13:260–270. doi: 10.1038/nrg3182. - DOI - PMC - PubMed
1. Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. - DOI - PMC - PubMed
1. Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486:215–221. doi: 10.1038/nature11209. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 AI134808/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Affiliations

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources