Revisiting guidance on population sampling for highly polymorphic STR loci

Sanne E Aalbers¹, Katherine B Gettings²

Affiliations

¹ Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20740, USA; US National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA. Electronic address: saalbers@umd.edu.
² US National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA.

PMID: 40779965
PMCID: PMC12382342
DOI: 10.1016/j.fsigen.2025.103336

Revisiting guidance on population sampling for highly polymorphic STR loci

Sanne E Aalbers et al. Forensic Sci Int Genet. 2026 Jan.

. 2026 Jan:80:103336.

doi: 10.1016/j.fsigen.2025.103336. Epub 2025 Aug 5.

Authors

Sanne E Aalbers¹, Katherine B Gettings²

Affiliations

¹ Department of Chemistry and Biochemistry, University of Maryland, College Park, MD 20740, USA; US National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA. Electronic address: saalbers@umd.edu.
² US National Institute of Standards and Technology, Biomolecular Measurement Division, 100 Bureau Drive, Gaithersburg, MD 20899, USA.

PMID: 40779965
PMCID: PMC12382342
DOI: 10.1016/j.fsigen.2025.103336

Abstract

Population databases allow us to attach probabilities to DNA evidence by the estimation of genotype frequencies, which rely on accurate allele frequency estimates. As short tandem repeat (STR) marker sets for human identification have expanded to include more discriminating markers, and especially now that sequencing techniques allow us to distinguish between alleles based on variation in underlying base-pair structure, it is important to reevaluate existing guidance on population database sizes for the estimation of allele frequencies. In this paper, we revisit the topic of population sampling by focusing on the representation of alleles, i.e. whether alleles are observed or not, in a sample of individuals containing data for highly polymorphic autosomal STR loci. The effect of both length- and sequence-based STR data on population sample size implications are demonstrated, and differences between lesser and more polymorphic markers are discussed. The consequences of using a limited number of individuals are explored and the impact of increasing population sample sizes by combining different data sets is shown to help determine the point at which further sampling may no longer provide significant value. Finally, different approaches for accommodating previously unobserved alleles and their impact on DNA evidence evaluations are discussed.

Keywords: Allele frequency estimation; Forensic sequence data; Population sample size; Population studies.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1.**
Graphical representation of population sampling in a forensic setting. A population sample is taken from a population of interest and allele frequencies are reported. Results are displayed per marker and allele, with added sequence variation if available, and grouped by population. Example data has been displayed for marker TPOX as published in Gettings et al. [1].

**Fig. 2.**
Graphical depiction of theoretical results for Chakraborty’s population sample size thresholds with a 95 % confidence level. a (left): Population sample size predicted to include a varying number of alleles with frequency of at least $p$ . b (right): Population sample size as a function of the number of alleles at a locus with at least frequency $p$ for different allele frequency thresholds. The black dashed lines indicate Chakraborty’s observation that 300 individuals would be sufficient to observe alleles with at least 1 % frequency.

**Fig. 3.**
Locus-specific population sample size thresholds for observing all alleles with frequency of at least 1 % in the NIST 1036 data set with 95 % confidence for length-based data (in blue) and sequence-based data (in red). The Hisp population group size of 236 individuals is indicated with the horizontal dotted line for reference.

**Fig. 4.**
Allele frequencies expected to be represented with 95 % confidence per locus and population group for sequence-based data for the NIST 1036 data set. Mean values per population group are indicated with dashed lines.

**Fig. 5.**
Distribution of observed number of common alleles per locus within 10,000 replicates of subsamples consisting of 50 individuals from the NIST AfAm population group. The total number of observed common alleles in the overall population group of the 342 AfAm NIST data are indicated with red diamonds.

**Fig. 6.**
Rarefaction curves for length-based (LB) and sequence-based (SB) data for a low (top) and high (bottom) polymorphic marker. Results have been plotted per population group (Cauc in blue, AfAm in red, Hisp in purple, Asian in green) by combining the NIST data set with the UNT and KCL data. Black dotted lines indicate the population sample size threshold for observing all common alleles according to Chakraborty’s theory [2].

**Fig. 7.**
Distribution of probabilities for the NIST 1036 data set according to NRC II Eqs. 4.1a and 4.1b using a jackknife procedure and the Chakraborty bound to estimate minimum allele frequencies. Results are evaluated using different “databases” (correct one highlighted in gray) constructed for the four different NIST 1036 population groups.

See this image and copyright information in PMC

References

1. Gettings KB, Borsuk LA, Steffen CR, Kiesler KM, Vallone PM, Sequence-based U.S. Population data for 27 autosomal STR loci, For. Sci. Int. Gene 37 (2018) 106–115, 10.1016/j.fsigen.2018.07.013. - DOI - PMC - PubMed
1. Chakraborty R, Sample size requirements for addressing the population genetic issues of forensic use of DNA typing, Hum. Biol 64 (2) (1992) 141–159. - PubMed
1. National Research Council. (1992). DNA Technology in Forensic Science. Washington, DC: National Academies Press. - PubMed
1. National Research Council. (1996). The Evaluation of Forensic DNA Evidence. Washington, DC: National Academies Press. 10.17226/5141. - DOI - PubMed
1. Buckleton JS, Bright J, & Taylor D. (2016). Forensic DNA Evidence Interpretation (Second Edition). CRC Press.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

9999-NIST/ImNIST/Intramural NIST DOC/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Revisiting guidance on population sampling for highly polymorphic STR loci

Affiliations

Revisiting guidance on population sampling for highly polymorphic STR loci

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials