Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 15;21(1):459.
doi: 10.1186/s12859-020-03802-0.

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Affiliations

In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision

Jonathan Louis Golob et al. BMC Bioinformatics. .

Abstract

Background: High-throughput sequencing can establish the functional capacity of a microbial community by cataloging the protein-coding sequences (CDS) present in the metagenome of the community. The relative performance of different computational methods for identifying CDS from whole-genome shotgun sequencing is not fully established.

Results: Here we present an automated benchmarking workflow, using synthetic shotgun sequencing reads for which we know the true CDS content of the underlying communities, to determine the relative performance (sensitivity, positive predictive value or PPV, and computational efficiency) of different metagenome analysis tools for extracting the CDS content of a microbial community. Assembly-based methods are limited by coverage depth, with poor sensitivity for CDS at < 5X depth of sequencing, but have excellent PPV. Mapping-based techniques are more sensitive at low coverage depths, but can struggle with PPV. We additionally describe an expectation maximization based iterative algorithmic approach which we show to successfully improve the PPV of a mapping based technique while retaining improved sensitivity and computational efficiency.

Conclusion: Our benchmarking approach reveals the trade-offs of assembly versus alignment-based approaches and the relative performance of specific implementations when one wishes to extract the protein coding capacity of microbial communities.

Keywords: Bioinformatics; Metagenomics; Microbiome.

PubMed Disclaimer

Conflict of interest statement

The authors have no conflicts of interest to disclose.

Figures

Fig. 1
Fig. 1
Positive predictive value (PPV), sensitivity, and uniqueness of CDS calls by metagenomic analysis approaches. The positive predictive value (true positive over true positive plus false positive), sensitivity (true positive over true positive plus false negative) both overall and subsetted to CDS with 0–5x coverage, and uniqueness (true positive over true positive plus duplicates) on a per-CDS basis with different analysis approaches
Fig. 2
Fig. 2
Sensitivity and uniqueness of CDS calls with respect to CDS coverage depth. Mapping based approaches are both more sensitive, and achieve a plateau of sensitivity at a lower coverage depth as compared to assembly-based methods
Fig. 3
Fig. 3
The problem of multiply-mapping short-reads, and the FAMLI algorithm schematized. Three hundred and sixty simulated reads were generated from three CDS. These simulated read was aligned against the UniRef100 database, and all CDS with an alignment within 10% identity of the best match were retained. a The read-depth coverage of the three true peptides (top) b evenness filtering is used to remove the least likely to be present references from being considered. The left column is three randomly selected references that are successfully filtered at this step, the right three false references that are not filtered. c The iterative likelihood-based filtering of one randomly selected synthetic read. Each circle represents one remaining aligned reference CDS for this read; the true positive origin reference is in dark green. The length of each line is proportional to the calculated score at this iteration. d The number of CDS per read as a violin plot. After the tenth iteration, only one reference CDS (the correct) remains for this read

References

    1. NIH HMP Working Group. Peterson J, Garges S, Giovanni M, McInnes P, Wang L, et al. The NIH human microbiome project. Genome Res. 2009;19:2317–2323. doi: 10.1101/gr.096651.109. - DOI - PMC - PubMed
    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65. doi: 10.1038/nature08821. - DOI - PMC - PubMed
    1. Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012;13:260–270. doi: 10.1038/nrg3182. - DOI - PMC - PubMed
    1. Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486:207–214. doi: 10.1038/nature11234. - DOI - PMC - PubMed
    1. Human Microbiome Project Consortium A framework for human microbiome research. Nature. 2012;486:215–221. doi: 10.1038/nature11209. - DOI - PMC - PubMed

LinkOut - more resources