Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;25(7):e70011.
doi: 10.1111/1755-0998.70011. Epub 2025 Jul 11.

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Affiliations

'Highly-Informative' Genetic Markers Can Bias Conclusions: Examples and General Solutions

Andy Lee et al. Mol Ecol Resour. 2025 Oct.

Abstract

High-grading bias is the overestimation power in a subset of loci caused by model overfitting. Using both empirical and simulated datasets, we show that high-grading bias can cause severe overestimation of population structure, and thus mislead investigators, whenever highly informative or high-FST markers are chosen (i.e., ascertained) and used for subsequent assessments, a common practice in population genetic studies. This problem can occur in panmictic populations with no local adaptation. Biased results from choosing high-FST markers may have severe downstream implications for management and conservation, such as erroneous conservation unit delineation, which could squander limited conservation resources to protect incorrectly defined 'populations'. Furthermore, we caution that high-grading is not limited to FST approaches; high-grading bias is a concern whenever a small subset of markers are first chosen to explain differences among groups based on their degree of difference and are subsequently reused to estimate the degree of difference among those groups. For example, selecting high FST loci for use in a GT-seq panel or using differentially expressed genes to plot sample membership in multivariate space can both result in spurious structure when none exists. We illustrate that using statistically based outlier tests in place of arbitrary FST cut-offs can reduce bias. Alternatively, permutation tests or cross-evaluation can be used to detect high-grading bias. We provide an R package, PCAssess, to help researchers detect and prevent high-grading bias in genetic datasets by automating permutation tests and principal component analyses (https://github.com/hemstrow/PCAssess).

Keywords: ecological genetics; genomics/proteomics; natural selection and contemporary evolution; population genetics—theoretical.

PubMed Disclaimer

Conflict of interest statement

Benefits Sharing: This study provides methodology and addresses issues intended to help population geneticists and improve analytical approaches for the broader scientific field. All collaborators are included as co‐authors.

The authors declare no conflicts of interest.

Figures

FIGURE 1
FIGURE 1
High‐grading bias in clustering algorithms and population assignment using empirical data from a single, panmictic population of monarch butterflies (i.e., no real population structure). We randomly assigned individuals to one of four artificially created populations and detected erroneous structure when the top 5% of SNPs with the highest F ST values were used. (A–C) Population structure analysis results using all the SNPs found in this population (63,514 loci), and panels (D–F) illustrate population structure analyses using the highest‐F ST dataset, where only the top 5% of SNPs with the highest F ST were used. Panels (A) and (D) illustrate results using PCA, panels (B) and (E) illustrate results using STRUCTURE with k = 4 and 20,000 burn‐in and 100,000 MCMC iterations, and panels C and F illustrate results using the ‘self_assign’ function in the Bayesian assignment R package Rubias with the default parameters.
FIGURE 2
FIGURE 2
Conceptual figure illustrating a permutation approach that can be used to quantify high‐grading bias. In each permutation, the population IDs are randomly shuffled (without replacement), F ST recalculated, high‐F ST loci chosen, and F‐statistics calculated. The empirically observed change in clustering due to high‐grading (ΔF) is compared to a null distribution of ΔF from the 1000 permutations. A large ΔF represents a large observed increase in within‐group clustering when the highest‐F ST loci are used, and a smaller or negative ΔF represents less of an increase or a decrease in clustering when the highest‐F ST loci are used. Comparing ΔF values between empirical datasets and those where population IDs are permuted thousands of times allows for the detection of high‐grading bias.
FIGURE 3
FIGURE 3
Statistical detection of erroneous clustering due to high‐grading bias when high‐F ST loci are used in PCAs using simulated data for three commonly observed types of population structure: Panmictic (without selection), high gene flow (without selection), and high gene flow with local adaptation (with selection). Panels A–C show PCAs constructed from all SNPs, and panels (D–F) show PCAs constructed from the top 5% of high‐F ST SNPs. Note that all three scenarios show population structure in the latter case, even when none should exist (as in the panmictic scenario). Panels G–I depict the observed increase in clustering (ΔF) between the PCAs for all and top 5% SNPs (vertical red dashed line) alongside the expected null distribution of ΔF (solid black line) derived from permutations. Higher ΔF means a higher increase in clustering with high‐F ST loci (Figure 2). High ΔF in the observed data relative to the null distribution implies that the null hypothesis that clustering increases no more than expected by chance alone when taking high‐F ST loci can be rejected, as correctly seen only in the high gene flow with local adaptation scenario.
FIGURE 4
FIGURE 4
Reduction in high‐grading bias on population structure analyses when using statistically identified outlier SNPs. We used simulated data of three types of population structure: (A) panmictic (without selection), (B) high gene flow (without selection), and (C) high gene flow with local adaptation (with selection). Panels A–C illustrate PCAs using all SNPs, D–F use only SNPs that were outliers found by PCAdapt, and G‐I illustrate PCAs when only OutFLANK outliers were used. There were no shared SNPs between OutFLANK and PCAdapt; plots with no outlier loci identified are marked. Note that for the bottom‐right plot, the small number of observed outliers from OutFLANK caused many points to be plotted a top of one another in the two‐dimensional PCA, making interpretation challenging; the distribution of PC1 and PC2 scores per population is clearer and thus shown instead.
FIGURE 5
FIGURE 5
Management implications of high‐grading bias in five sub‐populations of pink salmon ( Oncorhynchus gorbuscha ). We use our R package PCAssess to plot PCAs using all available SNPs (left) and the top 5% SNPs with the highest F ST values among the population (right). We automate permutation testing to detect high‐grading bias in the package PCAssess (bottom), which shows that the chosen subset of loci does not provide a statistically significant, biologically relevant increase in population structure and thus cannot reject the null hypothesis of high‐grading bias (p > 0.57).

References

    1. Ali, O. A. , O'Rourke S. M., Amish S. J., et al. 2016. “Rad Capture (Rapture): Flexible and Efficient Sequence‐Based Genotyping.” Genetics 202, no. 2: 389–400. 10.1534/genetics.115.183665. - DOI - PMC - PubMed
    1. Anderson, E. C. 2010. “Assessing the Power of Informative Subsets of Loci for Population Assignment: Standard Methods Are Upwardly Biased.” Molecular Ecology Resources 10, no. 4: 701–710. 10.1111/j.1755-0998.2010.02846.x. - DOI - PubMed
    1. Andrews, K. R. , Good J. M., Miller M. R., Luikart G., and Hohenlohe P. A.. 2016. “Harnessing the Power of RADseq for Ecological and Evolutionary Genomics.” Nature Reviews Genetics 17, no. 2: 81–92. 10.1038/nrg.2015.28. - DOI - PMC - PubMed
    1. Banks, M. , Eichert W., and Olsen J.. 2003. “Which Genetic Loci Have Greater Population Assignment Power?” Bioinformatics (Oxford, England) 19: 1436–1438. 10.1093/bioinformatics/btg172. - DOI - PubMed
    1. Barr, K. , Bossu C. M., Bay R. A., et al. 2023. “Genetic and Environmental Drivers of Migratory Behavior in Western Burrowing Owls and Implications for Conservation and Management.” Evolutionary Applications 16, no. 12: 1889–1900. 10.1111/eva.13600. - DOI - PMC - PubMed

Substances

LinkOut - more resources