. 2020 Jan 6:6:e243.

doi: 10.7717/peerj-cs.243. eCollection 2020.

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Jarrett D Phillips¹, Steven H French¹, Robert H Hanner², Daniel J Gillis¹

Affiliations

¹ School of Computer Science, University of Guelph, Guelph, Ontario, Canada.
² Department of Integrative Biology, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario, Canada.

PMID: 33816897
PMCID: PMC7924493
DOI: 10.7717/peerj-cs.243

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Jarrett D Phillips et al. PeerJ Comput Sci. 2020.

. 2020 Jan 6:6:e243.

doi: 10.7717/peerj-cs.243. eCollection 2020.

Authors

Jarrett D Phillips¹, Steven H French¹, Robert H Hanner², Daniel J Gillis¹

Affiliations

¹ School of Computer Science, University of Guelph, Guelph, Ontario, Canada.
² Department of Integrative Biology, Biodiversity Institute of Ontario, University of Guelph, Guelph, Ontario, Canada.

PMID: 33816897
PMCID: PMC7924493
DOI: 10.7717/peerj-cs.243

Abstract

Assessing levels of standing genetic variation within species requires a robust sampling for the purpose of accurate specimen identification using molecular techniques such as DNA barcoding; however, statistical estimators for what constitutes a robust sample are currently lacking. Moreover, such estimates are needed because most species are currently represented by only one or a few sequences in existing databases, which can safely be assumed to be undersampled. Unfortunately, sample sizes of 5-10 specimens per species typically seen in DNA barcoding studies are often insufficient to adequately capture within-species genetic diversity. Here, we introduce a novel iterative extrapolation simulation algorithm of haplotype accumulation curves, called HACSim (Haplotype Accumulation Curve Simulator) that can be employed to calculate likely sample sizes needed to observe the full range of DNA barcode haplotype variation that exists for a species. Using uniform haplotype and non-uniform haplotype frequency distributions, the notion of sampling sufficiency (the sample size at which sampling accuracy is maximized and above which no new sampling information is likely to be gained) can be gleaned. HACSim can be employed in two primary ways to estimate specimen sample sizes: (1) to simulate haplotype sampling in hypothetical species, and (2) to simulate haplotype sampling in real species mined from public reference sequence databases like the Barcode of Life Data Systems (BOLD) or GenBank for any genomic marker of interest. While our algorithm is globally convergent, runtime is heavily dependent on initial sample sizes and skewness of the corresponding haplotype frequency distribution.

Keywords: Algorithm; DNA barcoding; Extrapolation; Iterative method; Sampling sufficiency; Species.

PubMed Disclaimer

Conflict of interest statement

The authors declare there are no competing interests.

Figures

**Figure 1. Modified haplotype network from Phillips, Gillis & Hanner (2019).**
Haplotypes are labelled according to their absolute frequencies such that the most frequent haplotype is labelled “1”, the second-most frequent haplotype is labelled “2”, etc., and is meant to illustrate that much species locus variation consists of rare haplotypes at very low frequency (typically only represented by 1 or 2 specimens). Thus, species showing such patterns in their haplotype distributions are probably grossly under-respresented in public sequence databases like BOLD and GenBank.

**Figure 2. Schematic of the HACSim optimization algorithm (setup, initialization and iteration).**
Shown is a hypothetical example for a species mined from a biological sequence database like BOLD or GenBank with N = 5 sampled specimens (DNA sequences) possessing H* = 5 unique haplotypes. Each haplotype has an associated numeric ID from 1-H* (here, 1-5). Haplotype labels are randomly assigned to cells on a two-dimensional spatial array (ARRAY) with perms rows and N columns. All haplotypes occur with a frequency of 20%, (i.e., probs = (1/5, 1/5, 1/5, 1/5, 1/5)). Specimen and haplotype information is then fed into a black box to iteratively optimize the likely required sample size (N*) needed to capture a proportion of at least p haplotypes observed in the species sample.

**Figure 3. Iterative extrapolation algorithm pseudocode for the computation of taxon sampling sufficiency employed within HACSim.**
A user must input N, H* and probs to run simulations. Other function arguments required by the algorithm have default values and are not necessary to be inputted unless the user wishes to alter set parameters.

**Figure 4. Graphical depiction of the iterative extrapolation sampling model as described in detail herein.**
The figure is modified from Phillips, Gillis & Hanner (2019). The x-axis is meant to depict the number of specimens sampled, whereas the y-axis is meant to convey the cumulative number of unique haplotypes uncovered for every additional individual that is randomly sampled. N_i and H_i refer respectively to specimen and haplotype numbers that are observed at each iteration ( i) of HACSim for a given species. N* is the total sample size that is needed to capture all H* haplotypes that exist for a species.

**Figure 5. Graphical output of HAC.sim() for a hypothetical species with equal haplotype frequencies.**
(A) Iterated haplotype accumulation curve. (B) Corresponding haplotype frequency barplot. For the generated haplotype accumulation curve, the 95% confidence interval for the number of unique haplotypes accumulated is depicted by gray error bars. Dashed lines depict the observed number of haplotypes (i.e., RH*) and corresponding number of individuals sampled found at each iteration of the algorithm. The dotted line depicts the expected number of haplotypes for a given haplotype recovery level (here, p = 95%) (i.e., pH*). In this example, R = 100% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of only N = 100 specimens.

**Figure 6. Initial graphical output of HAC.sim() for a hypothetical species having three dominant haplotypes.**
(A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 83.3% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of N = 100 specimens.

**Figure 7. Final graphical output of HAC.sim() for a hypothetical species having three dominant haplotypes.**
(A) Specimens sampled; (B) Unique haplotypes. In this example, upon convergence, R = 95.4% of the H* = 10 estimated haplotypes have been recovered for this species based on a sample size of N = 180 specimens.

**Figure 8. Initial haplotype frequency distribution for N= 235 high-quality lake whitefish (*Coregonus clupeaformis*) COI barcode sequences obtained from BOLD.**
This species displays a highly-skewed pattern of observed haplotype variation, with Haplotype 1 accounting for c. 91.5% (215/235) of all sampled records.

**Figure 9. Initial graphical output of HAC.sim() for a real species (Lake whitefish, *C. clupeaformis*) having a single dominant haplotype.**
(A) Specimens sampled; (B) Unique haplotypes. In this example, initially, only R = 73.8% of the H* = 15 estimated haplotypes for this species have been recovered based on a sample size of N = 235 specimens. The haplotype frequency barplot is identical to that of Fig. 8.

**Figure 10. Final graphical output of HAC.sim() for Lake whitefish (*C. clupeaformis*) having a single dominant haplotype.**
(A) Specimens sampled; (B) Unique haplotypes. Upon convergence, R = 95.8% of the H* = 15 estimated haplotypes for this species have been uncovered with a sample size of N = 604 specimens.

**Figure 11. Initial haplotype frequency distribution for N= 349 high-quality deer tick (*Ixodes scapularis*) COI barcode sequences obtained from BOLD.**
In this species, Haplotypes 1-8 account for c. 51.3% (179/349) of all sampled records.

**Figure 12. Initial graphical output of HAC.sim() for a real species (Deer tick, *I. scapularis*) having eight dominant haplotypes.**
In this example, initially, only R = 78.7% of the H* = 83 estimated haplotypes for this species have been recovered based on a sample size of N = 349 specimens. The haplotype frequency barplot is identical to that of Fig. 11.

**Figure 13. Final graphical output of HAC.sim() for deer tick (*I scapularis*) having eight dominant haplotypes.**
Upon convergence, R = 95.4% of the H* = 83 estimated haplotypes for this species have been uncovered with a sample size of N = 803 specimens.

**Figure 14. Initial haplotype frequency distribution for N= 171 high-quality scalloped hammerhead (*Sphyrna lewini*) COI barcode sequences obtained from BOLD.**
In this species, Haplotypes 1–3 account for c. 87.7% (150/171) of all sampled records.

**Figure 15. Initial graphical output of HAC.sim() for a real species (Scalloped hammerhead, *S. lewini*) having three dominant haplotypes.**
In this example, initially, only R = 82.6% of the H* = 12 estimated haplotypes for this species have been recovered based on a sample size of N = 171 specimens. The haplotype frequency barplot is identical to that of Fig. 14.

**Figure 16. Final graphical output of HAC.sim() for scalloped hammerhead (*S. lewini*) having three dominant haplotypes.**
Upon convergence, R = 95.6% of the H* = 12 estimated haplotypes for this species have been uncovered with a sample size of N = 414 specimens.

See this image and copyright information in PMC

Cited by

A DNA barcode-based survey of wild urban bees in the Loire Valley, France.
Villalta I, Ledet R, Baude M, Genoud D, Bouget C, Cornillon M, Moreau S, Courtial B, Lopez-Vaamonde C. Villalta I, et al. Sci Rep. 2021 Feb 26;11(1):4770. doi: 10.1038/s41598-021-83631-0. Sci Rep. 2021. PMID: 33637824 Free PMC article.
Genetic population dynamics of the critically endangered scalloped hammerhead shark (Sphyrna lewini) in the Eastern Tropical Pacific.
Harned SP, Bernard AM, Salinas-de-León P, Mehlrose MR, Suarez J, Robles Y, Bessudo S, Ladino F, López Garo A, Zanella I, Feldheim KA, Shivji MS. Harned SP, et al. Ecol Evol. 2022 Dec 28;12(12):e9642. doi: 10.1002/ece3.9642. eCollection 2022 Dec. Ecol Evol. 2022. PMID: 36619714 Free PMC article.
Application of deep autoencoder as an one-class classifier for unsupervised network intrusion detection: a comparative evaluation.
Vaiyapuri T, Binbusayyis A. Vaiyapuri T, et al. PeerJ Comput Sci. 2020 Dec 7;6:e327. doi: 10.7717/peerj-cs.327. eCollection 2020. PeerJ Comput Sci. 2020. PMID: 33816977 Free PMC article.
Opportunities and challenges of macrogenetic studies.
Leigh DM, van Rees CB, Millette KL, Breed MF, Schmidt C, Bertola LD, Hand BK, Hunter ME, Jensen EL, Kershaw F, Liggins L, Luikart G, Manel S, Mergeay J, Miller JM, Segelbacher G, Hoban S, Paz-Vinas I. Leigh DM, et al. Nat Rev Genet. 2021 Dec;22(12):791-807. doi: 10.1038/s41576-021-00394-0. Epub 2021 Aug 18. Nat Rev Genet. 2021. PMID: 34408318 Review.
VLF: An R package for the analysis of very low frequency variants in DNA sequences.
Phillips JD, Athey TBT, McNicholas PD, Hanner RH. Phillips JD, et al. Biodivers Data J. 2023 Jan 26;11:e96480. doi: 10.3897/BDJ.11.e96480. eCollection 2023. Biodivers Data J. 2023. PMID: 38327328 Free PMC article.

See all "Cited by" articles

References

1. Adams C, Knapp M, Gemmell N, Jeunen G-J, Bunce M, Lamare M, Taylor H. Beyond biodiversity: can environmental DNA (eDNA) cut it as a population genetics tool. Genes. 2019;10(192):1–20. doi: 10.3390/genes10030192. - DOI - PMC - PubMed
1. April J, Hanner RH, Dion-Côté A-M, Bernatchez L. Glacial cycles as an allopatric speciation pump in north-eastern American freshwater fishes. Molecular Ecology. 2013a;22(2):409–422. doi: 10.1111/mec.12116. - DOI - PubMed
1. April J, Hanner RH, Mayden RL, Bernatchez L. Metabolic rate and climatic fluctuations shape continental wide pattern of genetic divergence and biodiversity in fishes. PLOS ONE. 2013b;8(7):e70296. doi: 10.1371/journal.pone.0070296. - DOI - PMC - PubMed
1. April J, Mayden RL, Hanner RH, Bernatchez L. Genetic calibration of species diversity among North America’s freshwater fishes. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(26):10602–10607. doi: 10.1073/pnas.1016437108. - DOI - PMC - PubMed
1. Baker A, Sendra Tavares E, Elbourne R. Countering criticisms of single mitochondrial DNA gene barcoding in birds. Molecular Ecology Resources. 2009;9(S1):257–268. doi: 10.1111/j.1755-0998.2009.02650.x. - DOI - PubMed

Associated data

figshare/10.6084/m9.figshare.8870804.v1

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Affiliations

HACSim: an R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Associated data

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Associated data

Related information

LinkOut - more resources

Full Text Sources