OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery
- PMID: 38564640
- PMCID: PMC11009617
- DOI: 10.1073/pnas.2304671121
OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery
Abstract
Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.
Keywords: computational genomics; contingency table; finite-sample P-value; reference genome free inference.
Conflict of interest statement
Competing interests statement:T.Z.B. and J.S. have submitted a provisional patent no. 63/366,444 relating to this work.
Figures





Update of
-
OASIS: An interpretable, finite-sample valid alternative to Pearson's for scientific discovery.bioRxiv [Preprint]. 2023 Nov 3:2023.03.16.533008. doi: 10.1101/2023.03.16.533008. bioRxiv. 2023. Update in: Proc Natl Acad Sci U S A. 2024 Apr 9;121(15):e2304671121. doi: 10.1073/pnas.2304671121. PMID: 37961606 Free PMC article. Updated. Preprint.
References
-
- Chen Y., Diaconis P., Holmes S. P., Liu J. S., Sequential Monte Carlo methods for statistical analysis of tables. J. Am. Stat. Assoc. 100, 109–120 (2005).
-
- Pearson K., On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinb. Dublin Philosoph. Magaz. J. Sci. 50, 157–175 (1900).
-
- Agresti A., Categorical Data Analysis (John Wiley& Sons, 2012), vol. 792.
-
- Diaconis P., Sturmfels B., Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26, 363–397 (1998).
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Miscellaneous