Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 7:iyaf158.
doi: 10.1093/genetics/iyaf158. Online ahead of print.

Clade Distillation for Genome-wide Association Studies

Affiliations

Clade Distillation for Genome-wide Association Studies

Ryan Christ et al. Genetics. .

Abstract

Testing inferred haplotype genealogies for association with phenotypes has been a longstanding goal in human genetics given their potential to detect association signals driven by allelic heterogeneity - when multiple causal variants modulate a phenotype - in both coding and noncoding regions. Recent scalable methods for inferring locus-specific genealogical trees along the genome, or representations thereof, have made substantial progress towards this goal; however, the problem of testing these trees for association with phenotypes has remained unsolved due to the growth in the number of clades with increasing sample size. To address this issue, we introduce several practical improvements to the kalis ancestry inference engine, including a general optimal checkpointing algorithm for decoding hidden Markov models, thereby enabling efficient genome-wide analyses. We then propose LOCATER, a powerful new procedure based on the recently proposed Stable Distillation framework, to test local tree representations for trait association. Although LOCATER is demonstrated here in conjunction with kalis, it may be used for testing output from any ancestry inference engine, regardless of whether such engines return discrete tree structures, relatedness matrices, or some combination of the two at each locus. Using simulated quantitative phenotypes, our results indicate that LOCATER achieves substantial power gains over traditional single marker testing, ARG-Needle, and window-based testing in cases of allelic heterogeneity, while also improving causal region localization. These findings suggest that genealogy-based association testing will be a fruitful approach for gene discovery, especially for signals driven by multiple ultra-rare variants.

Keywords: ancestral recombination graph; checkpointing; quadratic form; stable distillation.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest

None declared.

Figures

Fig. 1.
Fig. 1.
The LOCATER Pipeline. We begin ancestry-based association testing with a set of putatively interesting target loci, typically identified via single marker testing, indexed {1,,L}. At each target locus , we extract the genotype vector G(){0,1,2}n and use an ancestry inference engine to infer local clade genotypes X(){0,1,2}n×p and/or a local relatedness matrix Ω(l)Rn×n. We then use LOCATER to calculate three P-values testing whether G(),X(), or Ω(l) predict the phenotype respectively. These three P-values are guaranteed to be independent under the null hypothesis, so they may be easily combined with many methods, in this paper we propose and use MSSE (see Methods), to obtain a combined ancestry-association P-value pC() at each target locus.
Fig. 2.
Fig. 2.
Dotplot of total association signal strength required to achieve 80% power (lower is better) under various simulation conditions where all causal variants were observed. Total association signal strength is the − log10 P-value that one would obtain by testing the simulated phenotype Y with an oracle ANOVA model that “knows” the causal variants and targets only those for testing. Causal variant # denotes the number of simulated causal variants. Causal variant type “any” means any variant could be causal; “doubletons” means only doubletons could be causal; “DAC [150,750]” means only variants with a derived allele count in [150, 750], corresponding to a derived allele frequency in [0.0025, 0.0125), could be causal.
Fig. 3.
Fig. 3.
Dotplot of total association signal strength required to achieve 80% power (lower is better) under various simulation conditions where all causal variants were hidden. Total association signal strength is the − log10 P-value that one would obtain by testing the simulated phenotype Y with an oracle ANOVA model that “knows” the causal variants and targets only those for testing. Causal variant # denotes the number of simulated causal variants. Causal variant type “any” means any variant could be causal; “doubletons” means only doubletons could be causal; “DAC [150,750]” means only variants with a derived allele count in [150, 750], corresponding to a derived allele frequency in [0.0025, 0.0125), could be causal.
Fig. 4.
Fig. 4.
Dotplot of total association signal strength required to achieve 80% power (lower is better) under various simulation conditions where all causal variants were observed, including comparison to oracle ACAT-O methods that are given the causal variant window. ACAT-O (rare) only tests variants with MAF < 0.01 whereas ACAT-O (all) tests all variants within the causal window. Total association signal strength is the − log10 P-value that one would obtain by testing the simulated phenotype Y with an oracle ANOVA model that “knows” the causal variants and targets only those for testing.

References

    1. Abell NS et al. 2022. Multiple causal variants underlie genetic associations in humans. Science. 375:1247–1254. 10.1126/science.abj5117. - DOI - PMC - PubMed
    1. Aslett LJM, Christ RR. 2024. kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R. BMC Bioinformatics. 25:86. 10.1186/s12859-024-05688-8. - DOI - PMC - PubMed
    1. Balkema AA, De Haan L. 1974. Residual life time at great age. Ann Probab. 2:792–804. 10.1214/aop/1176996548. - DOI
    1. Barnett I, Mukherjee R, Lin X. 2017. The generalized higher criticism for testing SNP-set effects in genetic association studies. J Am Stat Assoc. 112:64–76. 10.1080/01621459.2016.1192039. - DOI - PMC - PubMed
    1. Blanc J, Berg JJ. 2025. Testing for differences in polygenic scores in the presence of confounding. Genetics. 230:iyaf071. 10.1093/genetics/iyaf071. - DOI - PMC - PubMed

LinkOut - more resources