Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Apr 9;121(15):e2304671121.
doi: 10.1073/pnas.2304671121. Epub 2024 Apr 2.

OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery

Affiliations

OASIS: An interpretable, finite-sample valid alternative to Pearson's X2 for scientific discovery

Tavor Z Baharav et al. Proc Natl Acad Sci U S A. .

Abstract

Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference [K. Chaung et al., Cell 186, 5440-5456 (2023)], we develop Optimized Adaptive Statistic for Inferring Structure (OASIS), a family of statistical tests for contingency tables. OASIS constructs a test statistic which is linear in the normalized data matrix, providing closed-form P-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic's P-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. Using OASIS, we develop a method that can detect SARS-CoV-2 and Mycobacterium tuberculosis strains de novo, which existing approaches cannot achieve. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single-cell RNA sequencing, where under accepted noise models OASIS provides good control of the false discovery rate, while Pearson's [Formula: see text] consistently rejects the null. Additionally, we show in simulations that OASIS is more powerful than Pearson's [Formula: see text] in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.

Keywords: computational genomics; contingency table; finite-sample P-value; reference genome free inference.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement:T.Z.B. and J.S. have submitted a provisional patent no. 63/366,444 relating to this work.

Figures

Fig. 1.
Fig. 1.
Comparison of OASIS and Pearson’s X2 test for input matrix XNI×J. (A) OASIS computes a matrix of residuals X~ as in Eq. 2. Row (column) embeddings fRI (cRJ) are generated by one of several options. These vectors are used to compute the OASIS test statistic S in Eq. 1, which admits a finite-sample P-value bound using classical concentration inequalities. (B) X2 computes a matrix of residuals Xcorr as in Eq. 4, which is sensitive to deviations in low count rows, as seen in the Bottom four rows in the example matrix X and Xcorr. The X2 test then provides an asymptotically valid P-value via a distributional approximation. For interpretability, practitioners often use correspondence analysis (4) to interpret rejection of the null, a procedure with no statistical guarantees, which can fail to detect the desired structure. (C) depicts two example counts tables. The one on the left corresponds to concentrated (strong) signal, while the one on the right corresponds to diffuse (weak) signal. Both tables have 100 counts distributed evenly over 10 columns, with 12 rows. X2 assigns both of them similar significance, but OASIS assigns a much smaller P-value to the Left table than the Right, agreeing with our intuition. (D) plots the empirical CDF of the P-values of OASIS and Pearson’s X2. This is shown for the two classes of tables; ones with a strong concentrated two-group signal, and ones with a diffuse signal. OASIS yields significantly improved P-values for the case with strong signal, and substantially worse power than X2 in the weak signal case, which visually looks like noise. X2 on the other hand yields much more similar performance in the two settings. Here, OASIS-opt is shown, which is run over five independent splits of the dataset. The generative model for these tables is detailed in SI Appendix, section S.6.A, with additional plots showing e.g., the spectra of the centered and normalized contingency tables in (C), illustrating that OASIS prioritizes the first table with a more concentrated spectrum.
Fig. 2.
Fig. 2.
Figure showing the algorithms we build using OASIS. (A) OASIS-opt employs data-splitting to generate optimized, data-dependent f and c, before generating a statistically valid P-value bound using the held-out test data. (B and C) depict two algorithms we propose for inferring latent structure from a collection of tables defined on the same set of columns. As building blocks, we use OASIS-opt for embedding-aggregation, and OASIS-iter (Section 5.1) for counts-aggregation. (B) Embedding-aggregation (Algorithm 2) performs inference on each table marginally using OASIS-opt and aggregates the resulting sample embeddings. (C) Counts-aggregation (Algorithm 1) stacks the contingency tables into one large matrix Xagg, and performs iterative analysis on this aggregated table using OASIS-iter.
Fig. 3.
Fig. 3.
(A) SPLASH (2) generates contingency tables from genomic sequencing data, here FASTQ files, for all 4k possible anchor k-mers (length k genomic sequences). (B) Shown in greater detail is the process for one specific anchor, TGAAATTA. This anchor highlights a mutation between two strains of SARS-CoV-2, Omicron (purple) and Delta (orange). Below, viral sequencing data from four individuals (samples) infected with SARS-CoV-2 is shown. However, it is a priori unknown which strain each individual was infected with, and no reference genome is available. For the fixed anchor sequence (shown in blue), SPLASH counts for each sample the frequency of sequences that occur immediately after (target sequence), and generates a contingency table, where the columns are indexed by the samples and the rows are indexed by the sequences. Shown in (B) is one read in sample A which underwent sequencing error, highlighted in red, and thus yielded an additional discrete observation—a sequence—resulting in an extra row. Sequencing error leads to tables with many rows with low counts. Note that we cannot know a priori which rows of this table are due to sequencing error, as we simply observe raw sequencing data. (C) The contingency tables generated by SPLASH are defined over the same set of samples (patients), so we can use these tables to jointly infer sample origin. The plot shown depicts the results of embedding-aggregation (Algorithm 2) on SARS-CoV-2 data (7), perfectly predicting whether a patient has Delta or not, and yielding high predictive accuracy (92%) for subvariant classification (Omicron BA.1 versus BA.2). Counts-aggregation (Algorithm 1) can also be used to predict the strain of mutated targets, with 93% classification accuracy of whether a target was Delta or not. In the depicted toy example, this would correspond to grouping targets and individuals by strain as shown.
Fig. 4.
Fig. 4.
Analysis of SARS-CoV-2 coinfection data (7). Tables were generated by SPLASH (2) and tested with OASIS-opt. (A and B) depict the two dimensional embeddings generated by embedding-aggregation and counts-aggregation respectively, and (C and D) show only the one dimensional embedding. (A and C) depict the results of embedding-aggregation (Algorithm 2). (A) shows the generated 2D embeddings, which perfectly classify whether a patient has Delta or not at c(1)0, highlighted in (C). (B and D) depict the results of counts-aggregation (Algorithm 1). Perfect separation of Delta versus non-Delta samples, no longer at c(1)0. Analysis details in SI Appendix, section S.4.
Fig. 5.
Fig. 5.
Interpretation of OASIS-rejected null from M. tuberculosis data (7). Tables were generated by SPLASH (2) and tested with OASIS. (A) shows the generated 1D embeddings from embedding-aggregation (Algorithm 2), which perfectly classifies patients based on sub-sub-lineage at c(1)0. (B) depicts the results of counts-aggregation (Algorithm 1). Two samples are misclassified (visually, one on top of the other), but with a much larger margin for the rest. 2D plots with c(2) shown in SI Appendix, Fig. S14.

Update of

References

    1. Chen Y., Diaconis P., Holmes S. P., Liu J. S., Sequential Monte Carlo methods for statistical analysis of tables. J. Am. Stat. Assoc. 100, 109–120 (2005).
    1. Chaung K., et al. , Splash: A statistical, reference-free genomic algorithm unifies biological discovery. Cell 186, 5440–5456 (2023). - PMC - PubMed
    1. Pearson K., On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. London, Edinb. Dublin Philosoph. Magaz. J. Sci. 50, 157–175 (1900).
    1. Agresti A., Categorical Data Analysis (John Wiley& Sons, 2012), vol. 792.
    1. Diaconis P., Sturmfels B., Algebraic algorithms for sampling from conditional distributions. Ann. Stat. 26, 363–397 (1998).

LinkOut - more resources