Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 6:6:e243.
doi: 10.7717/peerj-cs.243. eCollection 2020.

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Affiliations

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Jarrett D Phillips et al. PeerJ Comput Sci. .

Abstract

Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5-10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution.

Keywords: Algorithm; DNA barcoding; Extrapolation; Iterative method; Sampling sufficiency; Species.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

Figure 1
Figure 1. Modified haplotype network from Phillips, Gillis & Hanner (2019).
Haplotypes are labelled according to their absolute frequencies such that the most frequent haplotype is labelled “1”, the second-most frequent haplotype is labelled “2”, etc., and is meant to illustrate that much species locus variation consists of rare haplotypes at very low frequency (typically only represented by 1 or 2 specimens). Thus, species showing such patterns in their haplotype distributions are probably grossly under-respresented in public sequence databases like BOLD and GenBank.
Figure 2
Figure 2. Schematic of the HACSim optimization algorithm (setup, initialization and iteration).
Shown is a hypothetical example for a species mined from a biological sequence database like BOLD or GenBank with N = 5 sampled specimens (DNA sequences) possessing H* = 5 unique haplotypes. Each haplotype has an associated numeric ID from 1-H* (here, 1-5). Haplotype labels are randomly assigned to cells on a two-dimensional spatial array (ARRAY) with perms rows and N columns. All haplotypes occur with a frequency of 20%, (i.e., probs = (1/5, 1/5, 1/5, 1/5, 1/5)). Specimen and haplotype information is then fed into a black box to iteratively optimize the likely required sample size (N*) needed to capture a proportion of at least p haplotypes observed in the species sample.
Figure 3
Figure 3. Iterative extrapolation algorithm pseudocode for the computation of taxon sampling sufficiency employed within HACSim.
A user must input N, H* and probs to run simulations. Other function arguments required by the algorithm have default values and are not necessary to be inputted unless the user wishes to alter set parameters.
Figure 4
Figure 4. Graphical depiction of the iterative extrapolation sampling model as described in detail herein.
The figure is modified from Phillips, Gillis & Hanner (2019). The x-axis is meant to depict the number of specimens sampled, whereas the y-axis is meant to convey the cumulative number of unique haplotypes uncovered for every additional individual that is randomly sampled. Ni and Hi refer respectively to specimen and haplotype numbers that are observed at each iteration ( i) of HACSim for a given species. N* is the total sample size that is needed to capture all H* haplotypes that exist for a species.
Figure 5
Figure 5. Graphical output of HAC.sim() for a hypothetical species with equal haplotype frequencies.
(A) Iterated haplotype accumulation curve. (B) Corresponding haplotype frequency barplot. For the generated haplotype accumulation curve, the 95% confidence interval for the number of unique haplotypes accumulated is depicted by gray error bars. Dashed lines depict the observed number of haplotypes (i.e., RH*) and corresponding number of individuals sampled found at each iteration of the algorithm. The dotted line depicts the expected number of haplotypes for a given haplotype recovery level (here, p = 95%) (i.e., pH*). In this example, R = 100% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of only N = 100 specimens.
Figure 6
Figure 6. Initial graphical output of HAC.sim() for a hypothetical species having three dominant haplotypes.
(A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 83.3% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of N = 100 specimens.
Figure 7
Figure 7. Final graphical output of HAC.sim() for a hypothetical species having three dominant haplotypes.
(A) Specimens sampled; (B) Unique haplotypes. In this example, upon convergence, R = 95.4% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of N = 180 specimens.
Figure 8
Figure 8. Initial haplotype frequency distribution for N= 235 high-quality lake whitefish (Coregonus clupeaformis) COI barcode sequences obtained from BOLD.
This species displays a highly-skewed pattern of observed haplotype variation, with Haplotype 1 accounting for c. 91.5% (215/235) of all sampled records.
Figure 9
Figure 9. Initial graphical output of HAC.sim() for a real species (Lake whitefish, C. clupeaformis) having a single dominant haplotype.
(A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 73.8% of the H* = 15 estimated haplotypes for this species have been recovered based on a sample size of N = 235 specimens. The haplotype frequency barplot is identical to that of Fig. 8.
Figure 10
Figure 10. Final graphical output of HAC.sim() for Lake whitefish (C. clupeaformis) having a single dominant haplotype.
(A) Specimens sampled; (B) Unique haplotypes. Upon convergence, R = 95.8% of the H* = 15 estimated haplotypes for this species have been uncovered with a sample size of N = 604 specimens.
Figure 11
Figure 11. Initial haplotype frequency distribution for N= 349 high-quality deer tick (Ixodes scapularis) COI barcode sequences obtained from BOLD.
In this species, Haplotypes 1-8 account for c. 51.3% (179/349) of all sampled records.
Figure 12
Figure 12. Initial graphical output of HAC.sim() for a real species (Deer tick, I. scapularis) having eight dominant haplotypes.
In this example, initially, only R = 78.7% of the H* = 83 estimated haplotypes for this species have been recovered based on a sample size of N = 349 specimens. The haplotype frequency barplot is identical to that of Fig. 11.
Figure 13
Figure 13. Final graphical output of HAC.sim() for deer tick (I scapularis) having eight dominant haplotypes.
Upon convergence, R = 95.4% of the H* = 83 estimated haplotypes for this species have been uncovered with a sample size of N = 803 specimens.
Figure 14
Figure 14. Initial haplotype frequency distribution for N= 171 high-quality scalloped hammerhead (Sphyrna lewini) COI barcode sequences obtained from BOLD.
In this species, Haplotypes 1–3 account for c. 87.7% (150/171) of all sampled records.
Figure 15
Figure 15. Initial graphical output of HAC.sim() for a real species (Scalloped hammerhead, S. lewini) having three dominant haplotypes.
In this example, initially, only R = 82.6% of the H* = 12 estimated haplotypes for this species have been recovered based on a sample size of N = 171 specimens. The haplotype frequency barplot is identical to that of Fig. 14.
Figure 16
Figure 16. Final graphical output of HAC.sim() for scalloped hammerhead (S. lewini) having three dominant haplotypes.
Upon convergence, R = 95.6% of the H* = 12 estimated haplotypes for this species have been uncovered with a sample size of N = 414 specimens.

Similar articles

Cited by

References

    1. Adams C, Knapp M, Gemmell N, Jeunen G-J, Bunce M, Lamare M, Taylor H. Beyond biodiversity: can environmental DNA (eDNA) cut it as a population genetics tool. Genes. 2019;10(192):1–20. doi: 10.3390/genes10030192. - DOI - PMC - PubMed
    1. April J, Hanner RH, Dion-Côté A-M, Bernatchez L. Glacial cycles as an allopatric speciation pump in north-eastern American freshwater fishes. Molecular Ecology. 2013a;22(2):409–422. doi: 10.1111/mec.12116. - DOI - PubMed
    1. April J, Hanner RH, Mayden RL, Bernatchez L. Metabolic rate and climatic fluctuations shape continental wide pattern of genetic divergence and biodiversity in fishes. PLOS ONE. 2013b;8(7):e70296. doi: 10.1371/journal.pone.0070296. - DOI - PMC - PubMed
    1. April J, Mayden RL, Hanner RH, Bernatchez L. Genetic calibration of species diversity among North America’s freshwater fishes. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(26):10602–10607. doi: 10.1073/pnas.1016437108. - DOI - PMC - PubMed
    1. Baker A, Sendra Tavares E, Elbourne R. Countering criticisms of single mitochondrial DNA gene barcoding in birds. Molecular Ecology Resources. 2009;9(S1):257–268. doi: 10.1111/j.1755-0998.2009.02650.x. - DOI - PubMed

LinkOut - more resources