Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 29;15(3):e1006898.
doi: 10.1371/journal.pcbi.1006898. eCollection 2019 Mar.

Bayes-optimal estimation of overlap between populations of fixed size

Affiliations

Bayes-optimal estimation of overlap between populations of fixed size

Daniel B Larremore. PLoS Comput Biol. .

Abstract

Measuring the overlap between two populations is, in principle, straightforward. Upon fully sampling both populations, the number of shared objects-species, taxonomical units, or gene variants, depending on the context-can be directly counted. In practice, however, only a fraction of each population's objects are likely to be sampled due to stochastic data collection or sequencing techniques. Although methods exists for quantifying population overlap under subsampled conditions, their bias is well documented and the uncertainty of their estimates cannot be quantified. Here we derive and validate a method to rigorously estimate the population overlap from incomplete samples when the total number of objects, species, or genes in each population is known, a special case of the more general β-diversity problem that is particularly relevant in the ecology and genomic epidemiology of malaria. By solving a Bayesian inference problem, this method takes into account the rates of subsampling and produces unbiased and Bayes-optimal estimates of overlap. In addition, it provides a natural framework for computing the uncertainty of its estimates, and can be used prospectively in study planning by quantifying the tradeoff between sampling effort and uncertainty.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

Fig 1
Fig 1. Stochastic sampling leads to variation in observed overlap.
The members of two hypothetical populations are represented by blue and green circles, respectively. Each population has 16 members, and s = 5 are shared members of both populations. In two independent sampling experiments, shown in top and bottom rows, na = nb = 8 members are sampled at random from each population (dark circles) while the other 8 members are not sampled (transparent circles). Observation of the first experiment finds an overlap of nab = 4, while observation of the second finds nab = 0.
Fig 2
Fig 2. Inference and uncertainty using the posterior.
The posterior distribution over s is plotted for the realistic scenario of na = 47, nb = 32, and nab = 20 [line; Eq (6)]. The posterior mean provides our estimate of the true overlap s^ [open circle; Eq (7)], and the interval accounting for at least 90% of the area under the posterior curve provides an equal-tailed 90% credible interval [shading; Eq (8)]. The S˚ estimate is shown for comparison [black cross; Eq (1)], and is typically less than or equal to s^.
Fig 3
Fig 3. Bayesian repertoire overlap consistently estimates true overlap.
Repertoires with true overlaps ranging from 0 to 60 were subsampled in simulations. As sampling rates increase from na = nb = 30 (left) to 40 (middle) and to 50 (right), the estimates of BRO (colored circles) approach the true values (dotted lines) symmetrically. Estimates from S˚ (crosses) approach the true values from below, systematically underestimating the true overlap. This bias is worse with lower sampling rates [7]. Similar results are found when nanb, and when the total repertoire sizes are different from each other (S1 Fig).
Fig 4
Fig 4. Credible intervals quantify uncertainty in overlap estimates.
By using Eq (8), 90% credible intervals are show above as error bars around the point estimates s^ for varying true overlap s. As sampling rate increases from na = nb = 30 (left) to 40 (middle) and to 50 (right), credible intervals shrink, indicating a reduction in uncertainty. In expectation, 90% of intervals cover the true overlap (dotted line).
Fig 5
Fig 5. Reevaluation of published results.
In 2010, Albrecht et al. compared var repertoires from 5 populations using pairwise type sharing (see Refs. [18, 19, 27] for original data details). (left) Reproduction of S˚ analysis of [19], rescaled from [0, 1]→[0, 60]. (middle) Reanalysis using Bayesian repertoire overlap [Eq (7)]. For all boxplots, boxes span inner quartiles; center lines show medians; whiskers extend to 2.5 and 97.5 percentiles. (right) Histograms of Bayesian repertoire overlap distributions from Amele and Ariquemes clones (data identical to those in middle boxplots) colored by width of credible interval [Eq (8)], a measure of uncertainty. Differences in uncertainties are driven primarily by sampling rates: Amele samples average n¯=15.6 sequences per parasite while Ariquemes clones average n¯=26.5.
Fig 6
Fig 6. Quantifying the decrease in uncertainty from increased sequencing.
Histograms show distributions of overlap estimates s^, computed using Eq (11), for various values of s which are indicated by color-matched dotted lines. While all estimates are distributed around the true values of s, increasing the number of colonies c from 48 (top) to 96 (middle) and to 144 (bottom) substantially decreases the error of estimates. For example the bottom plot shows that successfully sequencing c = 144 colonies from each parasite is guaranteed to produce estimates s^ that are off by at most 5 (8.3%) in either direction of the true s.

References

    1. Whittaker RH. Vegetation of the Siskiyou mountains, Oregon and California. Ecological Monographs. 1960;30(3):279–338. 10.2307/1943563 - DOI
    1. Koleff P, Gaston KJ, Lennon JJ. Measuring beta diversity for presence–absence data. Journal of Animal Ecology. 2003;72(3):367–382. 10.1046/j.1365-2656.2003.00710.x - DOI
    1. Barwell LJ, Isaac NJ, Kunin WE. Measuring β-diversity with species abundance data. Journal of Animal Ecology. 2015;84(4):1112–1122. 10.1111/1365-2656.12362 - DOI - PMC - PubMed
    1. Rao CR. Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology. 1982;21(1):24–43. 10.1016/0040-5809(82)90004-1 - DOI
    1. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology. 2005;71(12):8228–8235. 10.1128/AEM.71.12.8228-8235.2005 - DOI - PMC - PubMed

Publication types