Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Nov 5;10(11):3959-3967.
doi: 10.1534/g3.120.401575.

A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Affiliations

A Faster and More Accurate Algorithm for Calculating Population Genetics Statistics Requiring Sums of Stirling Numbers of the First Kind

Swaine L Chen et al. G3 (Bethesda). .

Abstract

Ewen's sampling formula is a foundational theoretical result that connects probability and number theory with molecular genetics and molecular evolution; it was the analytical result required for testing the neutral theory of evolution, and has since been directly or indirectly utilized in a number of population genetics statistics. Ewen's sampling formula, in turn, is deeply connected to Stirling numbers of the first kind. Here, we explore the cumulative distribution function of these Stirling numbers, which enables a single direct estimate of the sum, using representations in terms of the incomplete beta function. This estimator enables an improved method for calculating an asymptotic estimate for one useful statistic, Fu's [Formula: see text] By reducing the calculation from a sum of terms involving Stirling numbers to a single estimate, we simultaneously improve accuracy and dramatically increase speed.

Keywords: Asymptotic analysis; Cumulative distribution functions; Evolutionary inference; Numerical algorithms; Population genetics statistics; Stirling numbers of the first kind.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Graphs of φ(z)φ(z0) (A) and χ(t)χ(t0) (B) for n=100, m=38, with z022.81 and t0=19310.61.
Figure 2
Figure 2
Mollified error in estimating Fu’s Fs for θ(10,400), m=75 and n=100 (A) and for m=275 and n=500 (B). The data for the dashed curves are multiplied by a factor of 10 (A) and 100 (B), to make the error curves visible in the figures. Refer to the text for further details.
Figure 3
Figure 3
(A) Comparison of relative error of the estimator from (Chen 2019) and the single term asymptotic estimator in (35). Relative error for each is calculated against the arbitrary precision implementation described in (Chen 2019). In total, 10,000 calculations were performed with n randomly sampled from a uniform distribution between 50 and 500; m between 2 and n; and θ between 1 and 50. A solid diagonal line is drawn at y=x. Dotted lines are drawn at a relative error of 0.001. Numbers within each quadrant defined by the dotted lines indicate the number of points in each quadrant. The red dot indicates the one case where the relative error was >0.001 and the error of (35) was greater than the estimator from (Chen 2019). (B) Comparison of mollified error (34) as a function of m. For this plot, we fixed n=100 (solid lines) or 500 (dotted lines) and θ(10,500) (as indicated by different line colors).
Figure 4
Figure 4
(A) Comparison of run times between the hybrid algorithm from (Chen 2019) and the single term asymptotic estimator in (35). 100 iterations were run, each with 10,000 calculations; the time elapsed for each set of 10,000 calculations was recorded and plotted here. The same set of parameters were used for each algorithm. The order of running the algorithms was alternated with each iteration. The dark horizontal line indicates the median, the box indicates the first and third quartiles, the whiskers are drawn at 1.5x the interquartile range, and outliers are represented by open circles. The median for the hybrid algorithm is 62.64 s; the median for the asymptotic algorithm is 1.17 s. (B) Detailed benchmarking for n=100 (open violins) or 500 (gray violins), m(0.1n,0.2n,,0.9n), and θ(10,500). Fold speedup (ratio of the time taken for the hybrid calculator to that taken for the aysmptotic estimator) is plotted on the y-axis. Each dot represents one set of parameters; the violin plots summarize the density of points on the y-axis. Times were calculated for 100 iterations of each estimator for the same parameter values.

Similar articles

References

    1. Casillas S., and Barbadilla A., 2017. Molecular population genetics. Genetics 205: 1003–1035. 10.1534/genetics.116.196493 - DOI - PMC - PubMed
    1. Chen H., Alvarez J. J. S., Ng S. H., Nielsen R., and Zhai W., 2019. Passage adaptation correlates with the reduced efficacy of the influenza vaccine. Clin. Infect. Dis. 69: 1198–1204. 10.1093/cid/ciy1065 - DOI - PubMed
    1. Chen S. L., 2019. Implementation of a Stirling number estimator enables direct calculation of population genetics tests for large sequence datasets. Bioinformatics 35: 2668–2670. 10.1093/bioinformatics/bty1012 - DOI - PubMed
    1. Crane H., 2016. Rejoinder: The ubiquitous Ewens sampling formula. Stat. Sci. 31: 37–39. 10.1214/15-STS544 - DOI
    1. Ewens W. J., 1972. The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87–112. 10.1016/0040-5809(72)90035-4 - DOI - PubMed

Publication types

LinkOut - more resources