Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Mar 11:14:161.
doi: 10.1186/1471-2164-14-161.

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Affiliations

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Sheila J Barton et al. BMC Genomics. .

Abstract

Background: Statistical analysis of genome-wide microarrays can result in many thousands of identical statistical tests being performed as each probe is tested for an association with a phenotype of interest. If there were no association between any of the probes and the phenotype, the distribution of P values obtained from statistical tests would resemble a Uniform distribution. If a selection of probes were significantly associated with the phenotype we would expect to observe P values for these probes of less than the designated significance level, alpha, resulting in more P values of less than alpha than expected by chance.

Results: In data from a whole genome methylation promoter array we unexpectedly observed P value distributions where there were fewer P values less than alpha than would be expected by chance. Our data suggest that a possible reason for this is a violation of the statistical assumptions required for these tests arising from heteroskedasticity. A simple but statistically sound remedy (a heteroskedasticity-consistent covariance matrix estimator to calculate standard errors of regression coefficients that are robust to heteroskedasticity) rectified this violation and resulted in meaningful P value distributions.

Conclusions: The statistical analysis of 'omics data requires careful handling, especially in the choice of statistical test. To obtain meaningful results it is essential that the assumptions behind these tests are carefully examined and any violations rectified where possible, or a more appropriate statistical test chosen.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The distribution of P values for regression coefficients with a neuro-cognitive outcome. The association of 237,152 probe log ratio values with a continuous neuro-cognitive outcome were obtained using linear regression controlling for gender. Figure 1 shows that there are more P values ≤ 0.05 than would be expected by chance and thus suggests that some probes are significantly associated with the neuro-cognitive outcome.
Figure 2
Figure 2
The distribution of P values after a simulation exercise using permutation of outcome values. Figure 2 shows there are no more P values ≤ 0.05 than would be expected by chance, resulting in a distribution similar to a Uniform distribution.
Figure 3
Figure 3
The distribution of P values for regression coefficients with a body composition outcome. Figure 3 shows that fewer P values ≤ 0.05 were obtained than would be expected by chance.
Figure 4
Figure 4
Regression residuals for a body composition outcome plotted against values predicted by the regression. Figure 4 shows the regression residuals increasing with predicted value (the points on the plot spread out in a fan shape), indicating heteroskedasticity.
Figure 5
Figure 5
The distribution of P values with a body composition outcome using robust standard errors. Figure 5 indicates that a number of probes are associated with the outcome of interest as there are more P values ≤ 0.05 than would be expected by chance.
Figure 6
Figure 6
Comparison of P values ≤ 0.01 using classical linear regression recalculated using robust standard errors. Figure 6 shows that the majority of these P values are still ≤ 0.01 after re-estimation.

References

    1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57(1):289–300.
    1. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. - DOI - PMC - PubMed
    1. Fodor AA, Tickle TL, Richardson C. Towards the uniform distribution of null P values on Affymetrix microarrays. Genome Biol. 2007;8:R69. doi: 10.1186/gb-2007-8-5-r69. - DOI - PMC - PubMed
    1. Core Team R. R: A Language and Environment for Statistical Computing. 2012.
    1. Huang S, Podsypanina K, Chen Y, Cai W, Tsimelzon A, Hilsenbeck S, Li Y. Wnt-1 is dominant over neu in specifying mammary tumor expression profiles. Technol Cancer Res Treat. 2006;5:565–571. - PubMed

Publication types

LinkOut - more resources