Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jun;7(6):1092-101.
doi: 10.1038/ismej.2013.10. Epub 2013 Feb 14.

Robust estimation of microbial diversity in theory and in practice

Affiliations

Robust estimation of microbial diversity in theory and in practice

Bart Haegeman et al. ISME J. 2013 Jun.

Abstract

Quantifying diversity is of central importance for the study of structure, function and evolution of microbial communities. The estimation of microbial diversity has received renewed attention with the advent of large-scale metagenomic studies. Here, we consider what the diversity observed in a sample tells us about the diversity of the community being sampled. First, we argue that one cannot reliably estimate the absolute and relative number of microbial species present in a community without making unsupported assumptions about species abundance distributions. The reason for this is that sample data do not contain information about the number of rare species in the tail of species abundance distributions. We illustrate the difficulty in comparing species richness estimates by applying Chao's estimator of species richness to a set of in silico communities: they are ranked incorrectly in the presence of large numbers of rare species. Next, we extend our analysis to a general family of diversity metrics ('Hill diversities'), and construct lower and upper estimates of diversity values consistent with the sample data. The theory generalizes Chao's estimator, which we retrieve as the lower estimate of species richness. We show that Shannon and Simpson diversity can be robustly estimated for the in silico communities. We analyze nine metagenomic data sets from a wide range of environments, and show that our findings are relevant for empirically-sampled communities. Hence, we recommend the use of Shannon and Simpson diversity rather than species richness in efforts to quantify and compare microbial diversity.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Empirical sample data are consistent with very different communities. We consider the abundance data of a sample taken from a bacterial soil community (sample ‘Brazil' in (Roesch et al., 2007)). The sample consists of 26079 individuals belonging to 2880 species. We tried to reconstruct the community from which the sample was taken. Panels ac show the rank-abundance curve of three such reconstructed communities. The first community (panel a, in red) has 104 species; the second community (panel b, in blue) has 105 species; the third community (panel c, in green) has 106 species. For each of the three reconstructions the community rank-abundance curve is an extension of the sample rank-abundance curve (in black). We claim that each of the three reconstructed communities is compatible with the sample data. This can be seen from the rarefaction curves in panel d: the rarefaction curve for the sample data (black line) coincides with the rarefaction curves for the reconstructed communities (red line with squares for community in panel a, blue line with x-marks for community in panel b, and green line with diamonds for community in panel c). Because the sample data are consistent with very different values of the community richness, the community richness cannot be estimated from the sample data. The colour reproduction of this figure is available on the ISME Journal online.
Figure 2
Figure 2
Estimated species richness does not rank correctly communities. We generated three community abundance distributions, the rank-abundance curves of which are shown in panel a. Community C1 (red) has the smallest number of species; community C3 (green) has the largest number of species. The rarefaction curves of the three communities up to sample size 2 104 are shown in panel b. Based on the rarefaction data, one would conclude that community C1 is the most diverse and community C3 the least diverse. Hence, the ranking of the communities according to their observed diversity is inverted compared to the ranking according to their true diversity. This observation is confirmed when applying Chao's estimator to sample data. Community C1 is estimated to have 10 times more species than community C3, whereas in reality community C1 has 20 times less species than community C3. See Supplementary Table S1 for the numerical data of the communities. The colour reproduction of this figure is available on the ISME Journal online.
Figure 3
Figure 3
Extrapolating the rarefaction curve. The Hill diversity estimators formula image and formula image are based on reconstructions of the rarefaction curve Sm from sample data. For a sample of size M, the rarefaction curve Sm for mM can be estimated by subsampling (red full line). If the sample size M is large, the estimator has small uncertainty. The rarefaction curve Sm for m>M can be estimated by extrapolating the sample data beyond the sample size M. Different extrapolation scenarios are compatible with the sample data. We consider two extreme scenarios (dashed lines). A lower estimate is obtained by assuming that unobserved species are approximately as rare as the rarest observed species. An upper estimate is obtained by assuming that unobserved species are represented in the community by one individual. The difference between the two extremes quantifies the uncertainty of the extrapolation, shown as the shaded region. The uncertainty increases rapidly for m>>M.
Figure 4
Figure 4
Estimated Hill diversities for in silico communities. We generated samples from a community with power-law abundance distribution (S=106, z=2) and evaluated the estimators formula image and formula image for the Hill diversity Dα. We consider three sample sizes M (in columns: M=102, 104, 106) and three community sizes N (in rows: N=1010, 1015, 1020). The shaded range between formula image and formula image indicates the estimation uncertainty. The true Hill diversity Dα of the community is plotted in black. The Hill diversities between α=1 (Shannon) and α=2 (Simpson) are correctly estimated even for small sample size M. The estimates of Hill diversities less than α=1, including α=0 (species richness), are characterized by large uncertainty.
Figure 5
Figure 5
Estimated Hill diversities for natural microbial communities. We observe the same behavior as for the in silico generated data sets of Figure 4: for α⩾1 the Hill diversity Dα can be estimated accurately; for α<1 the estimation of the Hill diversity Dα has large uncertainty. We used the same data sets as Quince et al., (2008): a seawater bacterial sample from the upper ocean (Rusch et al., 2007), soil bacterial samples at four locations: Brazil, Florida, Illinois and Canada (Roesch et al., 2007), and seawater samples from deep-sea vents at two locations: FS312 and FS396, separated into bacteria and archaea (Huber et al., 2007). The community size was set to N=1015 for illustration; results are robust to changes in community size (see Supplementary Figure S4).

References

    1. Bent SJ, Forney LJ. The tragedy of the uncommon: understanding limitations in the analysis of microbial diversity. ISME J. 2008;2:689–695. - PubMed
    1. Bohannan BJM, Hughes JB. New approaches to analyzing microbial biodiversity data. Curr Opin Microbiol. 2003;6:282–287. - PubMed
    1. Brose U, Martinez ND, Williams RJ. Estimating species richness: sensitivity to sample coverage and insensitivity to spatial patterns. Ecology. 2003;84:2364–2377.
    1. Bunge J.2009Statistical estimation of uncultivated microbial diversityIn: Epstein SS (ed)Uncultivated Microorganisms Springer-Verlag; 1–18.
    1. Bunge J, Fitzpatrick M. Estimating the number of species: a review. J Amer Statist Assoc. 1993;88:364–373.

Publication types