Estimating DNA coverage and abundance in metagenomes using a gamma approximation

Sean D Hooper¹, Daniel Dalevi, Amrita Pati, Konstantinos Mavromatis, Natalia N Ivanova, Nikos C Kyrpides

Affiliations

PMID: 20008478
PMCID: PMC2815663
DOI: 10.1093/bioinformatics/btp687

Estimating DNA coverage and abundance in metagenomes using a gamma approximation

Sean D Hooper et al. Bioinformatics. 2010.

. 2010 Feb 1;26(3):295-301.

doi: 10.1093/bioinformatics/btp687. Epub 2009 Dec 14.

Authors

Sean D Hooper¹, Daniel Dalevi, Amrita Pati, Konstantinos Mavromatis, Natalia N Ivanova, Nikos C Kyrpides

Affiliation

¹ Department of Energy Joint Genome Institute (DOE-JGI), Genome Biology Program, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA. sean.d.hooper@genpat.uu.se

PMID: 20008478
PMCID: PMC2815663
DOI: 10.1093/bioinformatics/btp687

Abstract

Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The effect of community complexity on the expected number of reads per bin (λ_i) for each bin i. In a simple community with a single genome, λ is approximately equal for all bins. Here, the observed bin spectrum (number of reads per bin) follows a Poisson distribution. However, in metagenomic samples from complex communities, bins will be drawn from different genomes that are present in varying abundances. Therefore, the value of λ is not the same for all bins. If λ follows a gamma distribution, then the bin spectrum will follow a negative binomial distribution and can be modeled.

**Fig. 2.**
(a) Blue curve: estimated log bin spectrum for the Lake Washington formate dataset. Red stars: the log number of observed bins. Note that the observed value at zero reads per contig is zero. The χ²-score for this fit is 3.7; we cannot reject the assumption that the bin abundance is gamma-like. (b) Blue curve: observed and estimated log bin abundance distribution for the termite hindgut dataset. The χ²-value is 1.0.

See this image and copyright information in PMC

References

1. Angly F, et al. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics. 2005;6:41. - PMC - PubMed
1. Brass W. Simplified methods of fitting the truncated negative binomial distribution. Biometrika. 1958;45:9.
1. Breitbart M, et al. Genomic analysis of uncultured marine viral communities. Proc. Natl. Acad. USA. 2002;99:14250–14255. - PMC - PubMed
1. Chao A. Nonparametric estimation of the number of classes in a population. Scand. J. Statist. 1984;11:5.
1. Chao A, Bunge J. Estimating the number of species in a stochastic abundance model. Biometrics. 2002;58:531–539. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Estimating DNA coverage and abundance in metagenomes using a gamma approximation

Affiliation

Estimating DNA coverage and abundance in metagenomes using a gamma approximation

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources