. 2025 Oct;25(7):e70011.

doi: 10.1111/1755-0998.70011. Epub 2025 Jul 11.

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Andy Lee¹, William Hemstrom^{1

2}, Natalie Molea^{3

4}, Gordon Luikart^{3

4}, Mark R Christie^{1

5}

Affiliations

¹ Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA.
² Department of Biology, Colorado State University, Fort Collins, Colorado, USA.
³ Wildlife Biology Program, University of Montana, Missoula, Montana, USA.
⁴ Flathead Lake Biological Station, University of Montana, Polson, Montana, USA.
⁵ Department of Forestry and Natural Resources, Purdue University, West Lafayette, Indiana, USA.

PMID: 40641441
PMCID: PMC12415817
DOI: 10.1111/1755-0998.70011

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Andy Lee et al. Mol Ecol Resour. 2025 Oct.

. 2025 Oct;25(7):e70011.

doi: 10.1111/1755-0998.70011. Epub 2025 Jul 11.

Authors

Andy Lee¹, William Hemstrom^{1

2}, Natalie Molea^{3

4}, Gordon Luikart^{3

4}, Mark R Christie^{1

5}

Affiliations

¹ Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA.
² Department of Biology, Colorado State University, Fort Collins, Colorado, USA.
³ Wildlife Biology Program, University of Montana, Missoula, Montana, USA.
⁴ Flathead Lake Biological Station, University of Montana, Polson, Montana, USA.
⁵ Department of Forestry and Natural Resources, Purdue University, West Lafayette, Indiana, USA.

PMID: 40641441
PMCID: PMC12415817
DOI: 10.1111/1755-0998.70011

Abstract

High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-F_ST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-F_ST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined 'populations'. Furthermore, we caution that high-grading is not limited to F_ST approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high F_ST loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary F_ST cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).

Keywords: ecological genetics; genomics/proteomics; natural selection and contemporary evolution; population genetics—theoretical.

PubMed Disclaimer

Conflict of interest statement

Benefits Sharing: This study provides methodology and addresses issues intended to help population geneticists and improve analytical approaches for the broader scientific field. All collaborators are included as co‐authors.

The authors declare no conflicts of interest.

Figures

**FIGURE 1**
High‐grading bias in clustering algorithms and population assignment using empirical data from a single, panmictic population of monarch butterflies (i.e., no real population structure). We randomly assigned individuals to one of four artificially created populations and detected erroneous structure when the top 5% of SNPs with the highest F _ST values were used. (A–C) Population structure analysis results using all the SNPs found in this population (63,514 loci), and panels (D–F) illustrate population structure analyses using the highest‐F _ST dataset, where only the top 5% of SNPs with the highest F _ST were used. Panels (A) and (D) illustrate results using PCA, panels (B) and (E) illustrate results using STRUCTURE with k = 4 and 20,000 burn‐in and 100,000 MCMC iterations, and panels C and F illustrate results using the ‘self_assign’ function in the Bayesian assignment R package Rubias with the default parameters.

**FIGURE 2**
Conceptual figure illustrating a permutation approach that can be used to quantify high‐grading bias. In each permutation, the population IDs are randomly shuffled (without replacement), F _ST recalculated, high‐F _ST loci chosen, and F‐statistics calculated. The empirically observed change in clustering due to high‐grading (ΔF) is compared to a null distribution of ΔF from the 1000 permutations. A large ΔF represents a large observed increase in within‐group clustering when the highest‐F _ST loci are used, and a smaller or negative ΔF represents less of an increase or a decrease in clustering when the highest‐F _ST loci are used. Comparing ΔF values between empirical datasets and those where population IDs are permuted thousands of times allows for the detection of high‐grading bias.

**FIGURE 3**
Statistical detection of erroneous clustering due to high‐grading bias when high‐F _ST loci are used in PCAs using simulated data for three commonly observed types of population structure: Panmictic (without selection), high gene flow (without selection), and high gene flow with local adaptation (with selection). Panels A–C show PCAs constructed from all SNPs, and panels (D–F) show PCAs constructed from the top 5% of high‐F _ST SNPs. Note that all three scenarios show population structure in the latter case, even when none should exist (as in the panmictic scenario). Panels G–I depict the observed increase in clustering (ΔF) between the PCAs for all and top 5% SNPs (vertical red dashed line) alongside the expected null distribution of ΔF (solid black line) derived from permutations. Higher ΔF means a higher increase in clustering with high‐F _ST loci (Figure 2). High ΔF in the observed data relative to the null distribution implies that the null hypothesis that clustering increases no more than expected by chance alone when taking high‐F _ST loci can be rejected, as correctly seen only in the high gene flow with local adaptation scenario.

**FIGURE 4**
Reduction in high‐grading bias on population structure analyses when using statistically identified outlier SNPs. We used simulated data of three types of population structure: (A) panmictic (without selection), (B) high gene flow (without selection), and (C) high gene flow with local adaptation (with selection). Panels A–C illustrate PCAs using all SNPs, D–F use only SNPs that were outliers found by PCAdapt, and G‐I illustrate PCAs when only OutFLANK outliers were used. There were no shared SNPs between OutFLANK and PCAdapt; plots with no outlier loci identified are marked. Note that for the bottom‐right plot, the small number of observed outliers from OutFLANK caused many points to be plotted a top of one another in the two‐dimensional PCA, making interpretation challenging; the distribution of PC1 and PC2 scores per population is clearer and thus shown instead.

**FIGURE 5**
Management implications of high‐grading bias in five sub‐populations of pink salmon ( *Oncorhynchus gorbuscha* ). We use our R package PCAssess to plot PCAs using all available SNPs (left) and the top 5% SNPs with the highest F _ST values among the population (right). We automate permutation testing to detect high‐grading bias in the package PCAssess (bottom), which shows that the chosen subset of loci does not provide a statistically significant, biologically relevant increase in population structure and thus cannot reject the null hypothesis of high‐grading bias (p > 0.57).

See this image and copyright information in PMC

References

1. Ali, O. A. , O'Rourke S. M., Amish S. J., et al. 2016. “Rad Capture (Rapture): Flexible and Efficient Sequence‐Based Genotyping.” Genetics 202, no. 2: 389–400. 10.1534/genetics.115.183665. - DOI - PMC - PubMed
1. Anderson, E. C. 2010. “Assessing the Power of Informative Subsets of Loci for Population Assignment: Standard Methods Are Upwardly Biased.” Molecular Ecology Resources 10, no. 4: 701–710. 10.1111/j.1755-0998.2010.02846.x. - DOI - PubMed
1. Andrews, K. R. , Good J. M., Miller M. R., Luikart G., and Hohenlohe P. A.. 2016. “Harnessing the Power of RADseq for Ecological and Evolutionary Genomics.” Nature Reviews Genetics 17, no. 2: 81–92. 10.1038/nrg.2015.28. - DOI - PMC - PubMed
1. Banks, M. , Eichert W., and Olsen J.. 2003. “Which Genetic Loci Have Greater Population Assignment Power?” Bioinformatics (Oxford, England) 19: 1436–1438. 10.1093/bioinformatics/btg172. - DOI - PubMed
1. Barr, K. , Bossu C. M., Bay R. A., et al. 2023. “Genetic and Environmental Drivers of Migratory Behavior in Western Burrowing Owls and Implications for Conservation and Management.” Evolutionary Applications 16, no. 12: 1889–1900. 10.1111/eva.13600. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

Directorate for Biological Sciences

LinkOut - more resources

Full Text Sources
- PubMed Central
- Wiley
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Affiliations

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous