Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Jan:80:103336.
doi: 10.1016/j.fsigen.2025.103336. Epub 2025 Aug 5.

Revisiting guidance on population sampling for highly polymorphic STR loci

Affiliations

Revisiting guidance on population sampling for highly polymorphic STR loci

Sanne E Aalbers et al. Forensic Sci Int Genet. 2026 Jan.

Abstract

Population databases allow us to attach probabilities to DNA evidence by the estimation of genotype frequencies, which rely on accurate allele frequency estimates. As short tandem repeat (STR) marker sets for human identification have expanded to include more discriminating markers, and especially now that sequencing techniques allow us to distinguish between alleles based on variation in underlying base-pair structure, it is important to reevaluate existing guidance on population database sizes for the estimation of allele frequencies. In this paper, we revisit the topic of population sampling by focusing on the representation of alleles, i.e. whether alleles are observed or not, in a sample of individuals containing data for highly polymorphic autosomal STR loci. The effect of both length- and sequence-based STR data on population sample size implications are demonstrated, and differences between lesser and more polymorphic markers are discussed. The consequences of using a limited number of individuals are explored and the impact of increasing population sample sizes by combining different data sets is shown to help determine the point at which further sampling may no longer provide significant value. Finally, different approaches for accommodating previously unobserved alleles and their impact on DNA evidence evaluations are discussed.

Keywords: Allele frequency estimation; Forensic sequence data; Population sample size; Population studies.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1.
Fig. 1.
Graphical representation of population sampling in a forensic setting. A population sample is taken from a population of interest and allele frequencies are reported. Results are displayed per marker and allele, with added sequence variation if available, and grouped by population. Example data has been displayed for marker TPOX as published in Gettings et al. [1].
Fig. 2.
Fig. 2.
Graphical depiction of theoretical results for Chakraborty’s population sample size thresholds with a 95 % confidence level. a (left): Population sample size predicted to include a varying number of alleles with frequency of at least p. b (right): Population sample size as a function of the number of alleles at a locus with at least frequency p for different allele frequency thresholds. The black dashed lines indicate Chakraborty’s observation that 300 individuals would be sufficient to observe alleles with at least 1 % frequency.
Fig. 3.
Fig. 3.
Locus-specific population sample size thresholds for observing all alleles with frequency of at least 1 % in the NIST 1036 data set with 95 % confidence for length-based data (in blue) and sequence-based data (in red). The Hisp population group size of 236 individuals is indicated with the horizontal dotted line for reference.
Fig. 4.
Fig. 4.
Allele frequencies expected to be represented with 95 % confidence per locus and population group for sequence-based data for the NIST 1036 data set. Mean values per population group are indicated with dashed lines.
Fig. 5.
Fig. 5.
Distribution of observed number of common alleles per locus within 10,000 replicates of subsamples consisting of 50 individuals from the NIST AfAm population group. The total number of observed common alleles in the overall population group of the 342 AfAm NIST data are indicated with red diamonds.
Fig. 6.
Fig. 6.
Rarefaction curves for length-based (LB) and sequence-based (SB) data for a low (top) and high (bottom) polymorphic marker. Results have been plotted per population group (Cauc in blue, AfAm in red, Hisp in purple, Asian in green) by combining the NIST data set with the UNT and KCL data. Black dotted lines indicate the population sample size threshold for observing all common alleles according to Chakraborty’s theory [2].
Fig. 7.
Fig. 7.
Distribution of probabilities for the NIST 1036 data set according to NRC II Eqs. 4.1a and 4.1b using a jackknife procedure and the Chakraborty bound to estimate minimum allele frequencies. Results are evaluated using different “databases” (correct one highlighted in gray) constructed for the four different NIST 1036 population groups.

Similar articles

References

    1. Gettings KB, Borsuk LA, Steffen CR, Kiesler KM, Vallone PM, Sequence-based U.S. Population data for 27 autosomal STR loci, For. Sci. Int. Gene 37 (2018) 106–115, 10.1016/j.fsigen.2018.07.013. - DOI - PMC - PubMed
    1. Chakraborty R, Sample size requirements for addressing the population genetic issues of forensic use of DNA typing, Hum. Biol 64 (2) (1992) 141–159. - PubMed
    1. National Research Council. (1992). DNA Technology in Forensic Science. Washington, DC: National Academies Press. - PubMed
    1. National Research Council. (1996). The Evaluation of Forensic DNA Evidence. Washington, DC: National Academies Press. 10.17226/5141. - DOI - PubMed
    1. Buckleton JS, Bright J, & Taylor D. (2016). Forensic DNA Evidence Interpretation (Second Edition). CRC Press.