Bayes-optimal estimation of overlap between populations of fixed size

Daniel B Larremore^{1

2}

Affiliations

¹ Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, United States of America.
² BioFrontiers Institute, University of Colorado Boulder, Boulder, Colorado, United States of America.

PMID: 30925165
PMCID: PMC6440621
DOI: 10.1371/journal.pcbi.1006898

Bayes-optimal estimation of overlap between populations of fixed size

Daniel B Larremore. PLoS Comput Biol. 2019.

. 2019 Mar 29;15(3):e1006898.

doi: 10.1371/journal.pcbi.1006898. eCollection 2019 Mar.

Author

Daniel B Larremore^{1

2}

Affiliations

¹ Department of Computer Science, University of Colorado Boulder, Boulder, Colorado, United States of America.
² BioFrontiers Institute, University of Colorado Boulder, Boulder, Colorado, United States of America.

PMID: 30925165
PMCID: PMC6440621
DOI: 10.1371/journal.pcbi.1006898

Abstract

Measuring the overlap between two populations is, in principle, straightforward. Upon fully sampling both populations, the number of shared objects-species, taxonomical units, or gene variants, depending on the context-can be directly counted. In practice, however, only a fraction of each population's objects are likely to be sampled due to stochastic data collection or sequencing techniques. Although methods exists for quantifying population overlap under subsampled conditions, their bias is well documented and the uncertainty of their estimates cannot be quantified. Here we derive and validate a method to rigorously estimate the population overlap from incomplete samples when the total number of objects, species, or genes in each population is known, a special case of the more general β-diversity problem that is particularly relevant in the ecology and genomic epidemiology of malaria. By solving a Bayesian inference problem, this method takes into account the rates of subsampling and produces unbiased and Bayes-optimal estimates of overlap. In addition, it provides a natural framework for computing the uncertainty of its estimates, and can be used prospectively in study planning by quantifying the tradeoff between sampling effort and uncertainty.

PubMed Disclaimer

Conflict of interest statement

The author has declared that no competing interests exist.

Figures

**Fig 1. Stochastic sampling leads to variation in observed overlap.**
The members of two hypothetical populations are represented by blue and green circles, respectively. Each population has 16 members, and s = 5 are shared members of both populations. In two independent sampling experiments, shown in top and bottom rows, n_a = n_b = 8 members are sampled at random from each population (dark circles) while the other 8 members are not sampled (transparent circles). Observation of the first experiment finds an overlap of n_ab = 4, while observation of the second finds n_ab = 0.

**Fig 2. Inference and uncertainty using the posterior.**
The posterior distribution over s is plotted for the realistic scenario of n_a = 47, n_b = 32, and n_ab = 20 [line; Eq (6)]. The posterior mean provides our estimate of the true overlap $\hat{s}$ [open circle; Eq (7)], and the interval accounting for at least 90% of the area under the posterior curve provides an equal-tailed 90% credible interval [shading; Eq (8)]. The $\overset{˚}{S}$ estimate is shown for comparison [black cross; Eq (1)], and is typically less than or equal to $\hat{s}$ .

**Fig 3. Bayesian repertoire overlap consistently estimates true overlap.**
Repertoires with true overlaps ranging from 0 to 60 were subsampled in simulations. As sampling rates increase from n_a = n_b = 30 (left) to 40 (middle) and to 50 (right), the estimates of BRO (colored circles) approach the true values (dotted lines) symmetrically. Estimates from $\overset{˚}{S}$ (crosses) approach the true values from below, systematically underestimating the true overlap. This bias is worse with lower sampling rates [7]. Similar results are found when n_a ≠ n_b, and when the total repertoire sizes are different from each other (S1 Fig).

**Fig 4. Credible intervals quantify uncertainty in overlap estimates.**
By using Eq (8), 90% credible intervals are show above as error bars around the point estimates $\hat{s}$ for varying true overlap s. As sampling rate increases from n_a = n_b = 30 (left) to 40 (middle) and to 50 (right), credible intervals shrink, indicating a reduction in uncertainty. In expectation, 90% of intervals cover the true overlap (dotted line).

**Fig 5. Reevaluation of published results.**
In 2010, Albrecht et al. compared *var* repertoires from 5 populations using pairwise type sharing (see Refs. [18, 19, 27] for original data details). (left) Reproduction of $\overset{˚}{S}$ analysis of [19], rescaled from [0, 1]→[0, 60]. (middle) Reanalysis using Bayesian repertoire overlap [Eq (7)]. For all boxplots, boxes span inner quartiles; center lines show medians; whiskers extend to 2.5 and 97.5 percentiles. (right) Histograms of Bayesian repertoire overlap distributions from Amele and Ariquemes clones (data identical to those in middle boxplots) colored by width of credible interval [Eq (8)], a measure of uncertainty. Differences in uncertainties are driven primarily by sampling rates: Amele samples average $\bar{n} = 15.6$ sequences per parasite while Ariquemes clones average $\bar{n} = 26.5$ .

**Fig 6. Quantifying the decrease in uncertainty from increased sequencing.**
Histograms show distributions of overlap estimates $\hat{s}$ , computed using Eq (11), for various values of s which are indicated by color-matched dotted lines. While all estimates are distributed around the true values of s, increasing the number of colonies c from 48 (top) to 96 (middle) and to 144 (bottom) substantially decreases the error of estimates. For example the bottom plot shows that successfully sequencing c = 144 colonies from each parasite is guaranteed to produce estimates $\hat{s}$ that are off by at most 5 (8.3%) in either direction of the true s.

See this image and copyright information in PMC

References

1. Whittaker RH. Vegetation of the Siskiyou mountains, Oregon and California. Ecological Monographs. 1960;30(3):279–338. 10.2307/1943563 - DOI
1. Koleff P, Gaston KJ, Lennon JJ. Measuring beta diversity for presence–absence data. Journal of Animal Ecology. 2003;72(3):367–382. 10.1046/j.1365-2656.2003.00710.x - DOI
1. Barwell LJ, Isaac NJ, Kunin WE. Measuring β-diversity with species abundance data. Journal of Animal Ecology. 2015;84(4):1112–1122. 10.1111/1365-2656.12362 - DOI - PMC - PubMed
1. Rao CR. Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology. 1982;21(1):24–43. 10.1016/0040-5809(82)90004-1 - DOI
1. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology. 2005;71(12):8228–8235. 10.1128/AEM.71.12.8228-8235.2005 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bayes-optimal estimation of overlap between populations of fixed size

Affiliations

Bayes-optimal estimation of overlap between populations of fixed size

Author

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources