A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Swaine L Chen¹, Nico M Temme²

Affiliations

¹ Infectious Diseases Programme, Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 119228, Singapore & Infectious Diseases Group, Genome Institute of Singapore, Singapore 138672 slchen@gis.a-star.edu.sg.
² Centrum Wiskunde & Informatica (CWI), Science Park 123, 1098 XG Amsterdam, The Netherlands.

PMID: 32900901
PMCID: PMC7642932
DOI: 10.1534/g3.120.401575

A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Swaine L Chen et al. G3 (Bethesda). 2020.

. 2020 Nov 5;10(11):3959-3967.

doi: 10.1534/g3.120.401575.

Authors

Swaine L Chen¹, Nico M Temme²

Affiliations

¹ Infectious Diseases Programme, Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 119228, Singapore & Infectious Diseases Group, Genome Institute of Singapore, Singapore 138672 slchen@gis.a-star.edu.sg.
² Centrum Wiskunde & Informatica (CWI), Science Park 123, 1098 XG Amsterdam, The Netherlands.

PMID: 32900901
PMCID: PMC7642932
DOI: 10.1534/g3.120.401575

Abstract

Ewen's sampling formula is a foundational theoretical result that connects probability and number theory with molecular genetics and molecular evolution; it was the analytical result required for testing the neutral theory of evolution, and has since been directly or indirectly utilized in a number of population genetics statistics. Ewen's sampling formula, in turn, is deeply connected to Stirling numbers of the first kind. Here, we explore the cumulative distribution function of these Stirling numbers, which enables a single direct estimate of the sum, using representations in terms of the incomplete beta function. This estimator enables an improved method for calculating an asymptotic estimate for one useful statistic, Fu's [Formula: see text] By reducing the calculation from a sum of terms involving Stirling numbers to a single estimate, we simultaneously improve accuracy and dramatically increase speed.

Keywords: Asymptotic analysis; Cumulative distribution functions; Evolutionary inference; Numerical algorithms; Population genetics statistics; Stirling numbers of the first kind.

PubMed Disclaimer

Figures

**Figure 1**
Graphs of $φ (z) - φ (z_{0})$ (A) and $χ (t) - χ (t_{0})$ (B) for $n = 100$ , $m = 38$ , with $z_{0} ≐ 22.81$ and $t_{0} = \frac{19}{31} ≐ 0.61$ .

**Figure 2**
Mollified error in estimating Fu’s $F_{s}$ for $θ \in (10, 400)$ , $m = 75$ and $n = 100$ (A) and for $m = 275$ and $n = 500$ (B). The data for the dashed curves are multiplied by a factor of 10 (A) and 100 (B), to make the error curves visible in the figures. Refer to the text for further details.

**Figure 3**
(A) Comparison of relative error of the estimator from (Chen 2019) and the single term asymptotic estimator in (35). Relative error for each is calculated against the arbitrary precision implementation described in (Chen 2019). In total, 10,000 calculations were performed with n randomly sampled from a uniform distribution between 50 and 500; m between 2 and n; and θ between 1 and 50. A solid diagonal line is drawn at $y = x$ . Dotted lines are drawn at a relative error of 0.001. Numbers within each quadrant defined by the dotted lines indicate the number of points in each quadrant. The red dot indicates the one case where the relative error was $> 0.001$ and the error of (35) was greater than the estimator from (Chen 2019). (B) Comparison of mollified error (34) as a function of m. For this plot, we fixed $n = 100$ (solid lines) or 500 (dotted lines) and $θ \in (10, 500)$ (as indicated by different line colors).

**Figure 4**
(A) Comparison of run times between the hybrid algorithm from (Chen 2019) and the single term asymptotic estimator in (35). 100 iterations were run, each with 10,000 calculations; the time elapsed for each set of 10,000 calculations was recorded and plotted here. The same set of parameters were used for each algorithm. The order of running the algorithms was alternated with each iteration. The dark horizontal line indicates the median, the box indicates the first and third quartiles, the whiskers are drawn at 1.5x the interquartile range, and outliers are represented by open circles. The median for the hybrid algorithm is 62.64 s; the median for the asymptotic algorithm is 1.17 s. (B) Detailed benchmarking for $n = 100$ (open violins) or 500 (gray violins), $m \in (0.1 n, 0.2 n, \dots, 0.9 n)$ , and $θ \in (10, 500)$ . Fold speedup (ratio of the time taken for the hybrid calculator to that taken for the aysmptotic estimator) is plotted on the y-axis. Each dot represents one set of parameters; the violin plots summarize the density of points on the y-axis. Times were calculated for 100 iterations of each estimator for the same parameter values.

See this image and copyright information in PMC

References

1. Casillas S., and Barbadilla A., 2017. Molecular population genetics. Genetics 205: 1003–1035. 10.1534/genetics.116.196493 - DOI - PMC - PubMed
1. Chen H., Alvarez J. J. S., Ng S. H., Nielsen R., and Zhai W., 2019. Passage adaptation correlates with the reduced efficacy of the influenza vaccine. Clin. Infect. Dis. 69: 1198–1204. 10.1093/cid/ciy1065 - DOI - PubMed
1. Chen S. L., 2019. Implementation of a Stirling number estimator enables direct calculation of population genetics tests for large sequence datasets. Bioinformatics 35: 2668–2670. 10.1093/bioinformatics/bty1012 - DOI - PubMed
1. Crane H., 2016. Rejoinder: The ubiquitous Ewens sampling formula. Stat. Sci. 31: 37–39. 10.1214/15-STS544 - DOI
1. Ewens W. J., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112. 10.1016/0040-5809(72)90035-4 - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Affiliations

A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Authors

Affiliations

Abstract

Figures

Similar articles

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources