Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Dec 1;5(24):5869-79.
doi: 10.1002/ece3.1846. eCollection 2015 Dec.

A simulation study of sample size for DNA barcoding

Affiliations

A simulation study of sample size for DNA barcoding

Arong Luo et al. Ecol Evol. .

Abstract

For some groups of organisms, DNA barcoding can provide a useful tool in taxonomy, evolutionary biology, and biodiversity assessment. However, the efficacy of DNA barcoding depends on the degree of sampling per species, because a large enough sample size is needed to provide a reliable estimate of genetic polymorphism and for delimiting species. We used a simulation approach to examine the effects of sample size on four estimators of genetic polymorphism related to DNA barcoding: mismatch distribution, nucleotide diversity, the number of haplotypes, and maximum pairwise distance. Our results showed that mismatch distributions derived from subsamples of ≥20 individuals usually bore a close resemblance to that of the full dataset. Estimates of nucleotide diversity from subsamples of ≥20 individuals tended to be bell-shaped around that of the full dataset, whereas estimates from smaller subsamples were not. As expected, greater sampling generally led to an increase in the number of haplotypes. We also found that subsamples of ≥20 individuals allowed a good estimate of the maximum pairwise distance of the full dataset, while smaller ones were associated with a high probability of underestimation. Overall, our study confirms the expectation that larger samples are beneficial for the efficacy of DNA barcoding and suggests that a minimum sample size of 20 individuals is needed in practice for each population.

Keywords: Coalescence; haplotype; maximum pairwise distance; mismatch distribution; nucleotide diversity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Split information around internal nodes of four chosen genealogies. The x‐axis represents seven internal nodes beginning at the root, while the y‐axis represents the size of the larger daughter clade. Among the ten trees consisting of 500 tips, data are shown here for tree_A (blue solid circles), tree_B (green solid circles), tree_F (red solid circles), and tree_I (yellow solid circles). Empty black circles represent data from a balanced tree topology.
Figure 2
Figure 2
Mismatch distributions together with kernel density estimates of dataset seq_I and its subsamples. Only the result from one randomly chosen subsample of each size is shown here.
Figure 3
Figure 3
Histograms showing distributions of nucleotide diversity values of subsamples from dataset seq_J. The blue curves are from kernel density estimates, while the red vertical lines indicate nucleotide diversity of the full dataset.
Figure 4
Figure 4
(A) Boxplots showing the numbers of haplotypes for every 100 repeats of subsamples of the same size from dataset seq_C. The x‐axis denotes the sample size, while the y‐axis represents the detailed number of haplotypes. (B) Ten asymptotic‐logarithm curves corresponding to the ten Michaelis‐Menten equations, which were estimated from the median values in boxplots of datasets from seq_A to seq_J.
Figure 5
Figure 5
Histograms showing distributions of maximum pairwise distances of subsamples from dataset seq_E. The red vertical lines indicate maximum pairwise distance of the full dataset.

References

    1. Aldous, D. J. 2001. Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today. Stat. Sci. 16:23–34.
    1. Austerlitz, F. , David O., Schaeffer B., Bleakley K., Olteanu M., Leblois R., et al. 2009. DNA barcode analysis: a comparison of phylogenetic and statistical classification methods. BMC Bioinformatics 10:S10. - PMC - PubMed
    1. Bergsten, J. , Bilton D. T., Fujisawa T., Elliott M., Monaghan M. T., Balke M., et al. 2012. The effect of geographical scale of sampling on DNA barcoding. Syst. Biol. 61:851–869. - PMC - PubMed
    1. Bortolussi, N. , Durand E., Blum M., and Francois O.. 2005. apTreeshape: statistical analysis of phylogenetic tree shape. Bioinformatics 22:363–364. - PubMed
    1. Colless, D. H. 1982. Review of “Phylogenetics: the theory and practice of phylogenetic systematics”. Syst. Zool. 31:100–104.

LinkOut - more resources