Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb;22(2):503-518.
doi: 10.1111/1755-0998.13482. Epub 2021 Sep 7.

Pseudoreplication in genomic-scale data sets

Affiliations

Pseudoreplication in genomic-scale data sets

Robin S Waples et al. Mol Ecol Resour. 2022 Feb.

Abstract

In genomic-scale data sets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (df') compared to the nominal degrees of freedom, df. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here, we measured pseudoreplication (quantified by the ratio df'/df) for a common metric of genetic differentiation (FST ) and a common measure of linkage disequilibrium between pairs of loci (r2 ). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated df' and df'/df by measuring the rate of decline in the variance of mean FST and mean r2 as more loci were used. For both indices, df' increases with Ne and genome size, as expected. However, even for large Ne and large genomes, df' for mean r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST , but df'/df ≤0.01 can occur in data sets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var (FST ), producing very conservative confidence intervals. Predicting df' based on our modelling results as a function of Ne , L, S, and genome size provides a robust way to quantify precision associated with genomic-scale data sets.

Keywords: FST; Ne; degrees of freedom; genome size; jackknife variance; linkage disequilibrium; simulations.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Experimental design for simulations. For each evolutionary scenario (combination of Ne and genome size), four ancestral populations (AP1–AP4) were simulated to ensure coalescence (10Ne generations), at which point each ancestral population split into four daughter populations (D1-D4). The 4x4 = 16 daughter populations then evolved independently under isolation for t = 0.2Ne generations. Subsequently, the model differed slightly for the FST and LD analyses. In the latter (as depicted in the figure), for each daughter population, eight mutational replicates (different set of loci) were generated based on the same pedigree, producing a total of 128 replicates for each evolutionary scenario. For FST, each set of four daughter populations allowed six pairwise comparisons of populations, and for each two-population pedigree six mutational replicates were generated (Figure S4).
Figure 2.
Figure 2.
Effective number of loci (L’) for mean r2 as a function of the number of diallelic (SNP) loci, L. Top: Influence of number of chromosomes (C), with Ne = 200 and S = 50. Bottom: Influence of Ne, with C = 16 and S = 25. Mean r2 was calculated across all n = L(L-1)/2 pairs of loci. Figure S8 (Supplementary Information) shows these same results except the Y axis is plotted as the effective number of locus pairs (n’).
Figure 3.
Figure 3.
Effective number of loci (L’) for mean r2 as a function of the sample size of individuals (S = 25-800) and the number of diallelic (SNP) loci, L. Results are for Ne = 800, C = 16, and using all pairwise comparisons of loci.
Figure 4.
Figure 4.
Variance components analysis for mean r2. As depicted in Table 3, V1 is the variance of mean r2 for the same individuals assayed for different, non-overlapping sets of loci, and V2 is the variance of mean r2 for different (potentially overlapping) sets of individuals assayed for the same loci. “Sum” = V1+V2 and “Observed” is the total observed variance of mean r2. Results are for Ne = 200, C = 16, S = 50, and using all pairwise comparisons of loci.
Figure 5.
Figure 5.
Comparison of parametric and actual 90% confidence intervals for N^e based on LD (top) and F^ST(L)Hudson (bottom). Parametric CIs use the nominal degrees of freedom (L = the number of diallelic (SNP) loci for F^ST(L)Hudson; n = L(L-1)/2 for LD); actual CIs use the effective degrees of freedom calculated in this study (L’ and n’). Results are for simulations with Ne = 200, C = 16, and S = 50. Note the different X-axis scales in the two panels.
Figure 6.
Figure 6.
Rate of decline in the variance of multilocus F^ST(L) as more diallelic loci (SNPs, L) were used in the analysis. Results are for simulations with Ne = 200, C = 4, and S = 25 and are shown for the estimators of Nei (F^STNei) and Hudson (F^STHudson). Figure S15 shows comparable results for another scenario with different values of Ne, C, and S.
Figure 7.
Figure 7.
Influence of Ne (top panel, with number of chromosomes, C, fixed at 16) and C (bottom panel, with Ne fixed at 200) on the effective degrees of freedom (L’) for F^ST(L)Nei computed between pairs of populations. Black dotted line represents L’ = L = the number of SNPs.
Figure 8.
Figure 8.
Effects of pedigree on variation in mean F^ST(L)Hudson. For each of two, 2-population pedigrees, 8 replicate samples (demarcated by vertical lines) were taken of S = 100 individuals. These results are for simulations with Ne = 200 and 4 chromosomes. Sampled individuals were drawn hypergeometrically from the Ne individuals in the final generation. For each sample, six mutational replicates generated non-overlapping sets of L = 5000 SNP loci that were used to compute mean F^ST(L). Solid horizontal lines (“Pedigree FST”) represent mean F^STHudson across all 8x6 = 48 replicates within each pedigree. The first set of samples shows results for comparison of daughter populations 1 and 2 and the second set of samples shows results for comparison of daughter populations 3 and 4, all derived from the same ancestral population.
Figure 9.
Figure 9.
Coverage of 90% confidence intervals (CIs) around FST estimators for the population pedigrees and samples shown in Figure 8 (Ne = 200; C = 4; L = 5000; S = 100). Top: CIs generated from block-jackknife estimates of var(F^STHudson). Bottom: CIs generated based on L’ for F^STNei estimated from this study. CI coverage is evaluated with respect to mean F^STHudson or mean F^STNei across all replicates within each pedigree (“Pedigree FST”, horizontal lines). The black X symbols indicate an upper (or lower) bound that was below (or above) the mean pedigree FST.

References

    1. Aarts E, Verhage M, Veenvliet JV, Dolan CV and Van Der Sluis S, 2014. A solution to dependency: using multilevel analysis to accommodate nested data. Nature Neuroscience, 17, 491. - PubMed
    1. Aguirre NC, Filippi CV, Zaina G, Rivas JG, Acuña CV, Villalba PV, García MN, González S, Rivarola M, Martínez MC and Puebla AF, 2019. Optimizing ddRADseq in non-model species: A case study in Eucalyptus dunnii Maiden. Agronomy, 9(9), p.484.
    1. Albrechtsen A, Nielsen FC and Nielsen R, 2010. Ascertainment biases in SNP chips affect measures of population divergence. Molecular Biology and Evolution, 27(11), 2534–2547. - PMC - PubMed
    1. Beverton RJH; Holt SJ (1957), On the Dynamics of Exploited Fish Populations, Fishery Investigations Series II Volume XIX, Ministry of Agriculture, Fisheries and Food.
    1. Bhaskar A, Song YS, 2009. Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright-Fisher models. Bioinformatics, 25, i187–i195. - PMC - PubMed

LinkOut - more resources