Fast and powerful conditional randomization testing via distillation

Molei Liu¹, Eugene Katsevich², Lucas Janson³, Aaditya Ramdas⁴

Affiliations

¹ Department of Biostatistics, Harvard Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, U.S.A.
² Department of Statistics and Data Science, Wharton School of the University of Pennsylvania, 265 South 37th Street, Philadelphia, Pennsylvania 19104, U.S.A.
³ Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A.
⁴ Department of Statistics & Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A.

PMID: 37416628
PMCID: PMC10323874
DOI: 10.1093/biomet/asab039

Fast and powerful conditional randomization testing via distillation

Molei Liu et al. Biometrika. 2022 Jun.

. 2022 Jun;109(2):277-293.

doi: 10.1093/biomet/asab039. Epub 2021 Jul 8.

Authors

Molei Liu¹, Eugene Katsevich², Lucas Janson³, Aaditya Ramdas⁴

Affiliations

¹ Department of Biostatistics, Harvard Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, U.S.A.
² Department of Statistics and Data Science, Wharton School of the University of Pennsylvania, 265 South 37th Street, Philadelphia, Pennsylvania 19104, U.S.A.
³ Department of Statistics, Harvard University, One Oxford Street, Cambridge, Massachusetts 02138, U.S.A.
⁴ Department of Statistics & Data Science, Carnegie Mellon University, 132H Baker Hall, Pittsburgh, Pennsylvania 15213, U.S.A.

PMID: 37416628
PMCID: PMC10323874
DOI: 10.1093/biomet/asab039

Abstract

We consider the problem of conditional independence testing: given a response $Y$ and covariates $(X, Z)$ , we test the null hypothesis that $Y ⫫ X ∣ Z$ . The conditional randomization test was recently proposed as a way to use distributional information about $X ∣ Z$ to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about $Y ∣ (X, Z)$ . This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.

Keywords: Conditional independence test; Conditional randomization test; High-dimensional inference; Machine learning; Model-X.

PubMed Disclaimer

Figures

**Fig. 1.**
Summary of the numbers of discoveries over 300 repetitions, with false discovery rate and familywise error rate control, in the breast cancer application. The area of each black point is proportional to the frequency that the corresponding method makes this number of discoveries in the 300 repetitions. The dCRT approaches are more powerful than oCRT and HRT. The knockoffs have no discoveries in around 45% of the experiments.

See this image and copyright information in PMC

References

1. Barber RF & Candès EJ (2015). Controlling the false discovery rate via knockoffs. Ann. Statist 43, 2055–85.
1. Bates S, Sesia M, Sabatti C & Candès E (2020). Causal inference in genetic trio studies. Proc. Nat. Acad. Sci 117, 24117–26. - PMC - PubMed
1. Bellot A & van der Schaar M (2019). Conditional independence testing using generative adversarial networks. Proc. Adv. Neural Inf. Proc. Syst 32, 2199–208.
1. Benjamini Y & Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300.
1. Benjamini Y & Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist 29, 1165–88.

Grants and funding

R01 LM013614/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Fast and powerful conditional randomization testing via distillation

Affiliations

Fast and powerful conditional randomization testing via distillation

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources