Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep;3(3):135-144.
doi: 10.1007/s40484-015-0049-7. Epub 2015 Oct 17.

Applications of species accumulation curves in large-scale biological data analysis

Affiliations

Applications of species accumulation curves in large-scale biological data analysis

Chao Deng et al. Quant Biol. 2015 Sep.

Abstract

The species accumulation curve, or collector's curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45-63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.

Keywords: accumulation region; immune repertoire; microbiome diversity; rational function approximation; species accumulation curve; species richness.

PubMed Disclaimer

Conflict of interest statement

The authors Chao Deng, Timothy Daley and Andrew D Smith declare they have no conflict of interest.

Figures

Figure 1
Figure 1. Predicting the number of unique words as a function of the size of the sample
The observed curve is the accumulation curve of the total word counts of Shakespeare’s known works compared to the predicted curve (A) when the size of the initial sample is 5% of the total words from Shakespeare’s known works; (B) when the size of the initial sample is 10%; and (C) when the size of the initial sample is 100% and comparing the RFA-GT lower bound to the lower bound of Efron and Thisted (E & T).
Figure 2
Figure 2. Annotated species as a function of the sample abundance using (A) a 5% subsample and (B) the full observed experiment
The x-axis is the sample abundance and the y-axis is the expected number of unique annotated species. Note that the original curve provided by MG-RAST uses the number of sequences as its x-axis. We convert it to the sample abundance by rescaling.
Figure 3
Figure 3. Age-related decrease in TCR repertoire
(A) Interpolation of the accumulation region of TCR β CDR3 clonotypes for each group. (B) Extrapolation of the accumulation region of TCR β CDR3 clonotypes for each group using the observed data in A. (C) Predicting the total number of unique TCR β CDR3 clonotypes as a function of the total number of TCR β cDNA molecules by combining all groups.
Figure 4
Figure 4. The number of distinct 31-mers as a function of sequenced 31-mers with extrapolations using 1% and 10% subsamples
(A) Extrapolations from the subsamples using default preseqR, with the rational function approximations to the Good-Toulmin power series behaving like a constant asymptotically. (B) Extrapolations from the subsample using rational function approximations to the Good-Toulmin power series behaving like a linear function asymptotically.

References

    1. Magurran AE. Ecological Diversity and Its Measurement. Vol. 168. Princeton: Princeton University Press; 1988.
    1. Bunge J, Fitzpatrick M. Estimating the number of species: A review. J. Am. Stat. Assoc. 1993;88:364–373.
    1. Colwell RK, Mao CX, Chang J. Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology. 2004;85:2717–2727.
    1. Efron B, Thisted R. Estimating the number of unseen species: How many words did Shakespeare know? Biometrika. 1976;63:435–447.
    1. Ionita-Laza I, Lange C, Laird NM. Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. USA. 2009;106:5008–5013. - PMC - PubMed

LinkOut - more resources