Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2007 Jun;40(3):343-52.
doi: 10.1016/j.jbi.2006.11.003. Epub 2006 Dec 1.

A statistical methodology for analyzing co-occurrence data from a large sample

Affiliations

A statistical methodology for analyzing co-occurrence data from a large sample

Hui Cao et al. J Biomed Inform. 2007 Jun.

Abstract

Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the chi2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (alpha=.05, chi2=3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the chi2-statistic. A plot of adjusted p-values (1 - p versus N(p)), where N(p) is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The plot of adjusted p values in the overall co-occurrence data
Figure 2
Figure 2
Plot of p values adjusted by the general volume test for the hypertension data
Figure 3
Figure 3
Plot of p values adjusted by the general volume test for the bipolar disorder data
Figure 4
Figure 4
Plot of p values adjusted by the general volume test for the depressive disorder data
Figure 5
Figure 5
Plot of p values from conditional volume tests for the hypertension data
Figure 6
Figure 6
Plot of p values adjusted by conditional volume test for the bipolar disorder data
Figure 7
Figure 7
Plot of p values adjusted by conditional volume test for the bipolar disorder data

Similar articles

Cited by

References

    1. Cao H, Markatou M, Melton GB, Chiang MF, Hripcsak G. Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics. Proc AMIA Symp. 2005:106–111. - PMC - PubMed
    1. Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J Amer Statist Assoc. 2004;99:96–104.
    1. Yates F. Contingency table involving small numbers and the χ2 test. Journal of the Royal Statistical Society (Supplement) 1934;1:217–235.
    1. Diaconis P, Efron B. Testing for independence in a two-way table. Annals of Statistics. 1985;13:845–874.
    1. Lindsay BG, Markatou M. Statistical Distances: A Global Framework to Inference. Springer Series in Statistics. in press.

Publication types