A statistical methodology for analyzing co-occurrence data from a large sample
- PMID: 17197246
- PMCID: PMC2041889
- DOI: 10.1016/j.jbi.2006.11.003
A statistical methodology for analyzing co-occurrence data from a large sample
Abstract
Determining important associations among items in a large database is challenging due to multiple simultaneous hypotheses and the ability to select weak associations that are statistically but not clinically significant. The simple application of the chi2 test among all possible pairs of items results in mostly inappropriate associations surpassing the traditional (alpha=.05, chi2=3.94) threshold. One can choose a stricter threshold to find stronger associations, but the choice may be arbitrary. We combined the volume test of Diaconis and Efron with a p-value plot to select a more rigorous and less arbitrary threshold. The volume test adjusts the p-value of the chi2-statistic. A plot of adjusted p-values (1 - p versus N(p)), where N(p) is the number of test statistics with a p-value greater than p, should be linear if there are no true associations. The point where the plot deviates from a line can be used as a threshold. We used linear regression to select the threshold in a reproducible fashion. In one experiment, we found that the method selected a threshold similar to that previously obtained by manually reviewing associations.
Figures
References
-
- Efron B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J Amer Statist Assoc. 2004;99:96–104.
-
- Yates F. Contingency table involving small numbers and the χ2 test. Journal of the Royal Statistical Society (Supplement) 1934;1:217–235.
-
- Diaconis P, Efron B. Testing for independence in a two-way table. Annals of Statistics. 1985;13:845–874.
-
- Lindsay BG, Markatou M. Statistical Distances: A Global Framework to Inference. Springer Series in Statistics. in press.
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous
