. 2013 Mar 11:14:161.

doi: 10.1186/1471-2164-14-161.

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Sheila J Barton¹, Sarah R Crozier, Karen A Lillycrop, Keith M Godfrey, Hazel M Inskip

Affiliations

PMID: 23496791
PMCID: PMC3610227
DOI: 10.1186/1471-2164-14-161

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Sheila J Barton et al. BMC Genomics. 2013.

. 2013 Mar 11:14:161.

doi: 10.1186/1471-2164-14-161.

Authors

Sheila J Barton¹, Sarah R Crozier, Karen A Lillycrop, Keith M Godfrey, Hazel M Inskip

Affiliation

¹ MRC Lifecourse Epidemiology Unit, University of Southampton, Southampton, UK. S.J.Barton@soton.ac.uk

PMID: 23496791
PMCID: PMC3610227
DOI: 10.1186/1471-2164-14-161

Abstract

Background: Statistical analysis of genome-wide microarrays can result in many thousands of identical statistical tests being performed as each probe is tested for an association with a phenotype of interest. If there were no association between any of the probes and the phenotype, the distribution of P values obtained from statistical tests would resemble a Uniform distribution. If a selection of probes were significantly associated with the phenotype we would expect to observe P values for these probes of less than the designated significance level, alpha, resulting in more P values of less than alpha than expected by chance.

Results: In data from a whole genome methylation promoter array we unexpectedly observed P value distributions where there were fewer P values less than alpha than would be expected by chance. Our data suggest that a possible reason for this is a violation of the statistical assumptions required for these tests arising from heteroskedasticity. A simple but statistically sound remedy (a heteroskedasticity-consistent covariance matrix estimator to calculate standard errors of regression coefficients that are robust to heteroskedasticity) rectified this violation and resulted in meaningful P value distributions.

Conclusions: The statistical analysis of 'omics data requires careful handling, especially in the choice of statistical test. To obtain meaningful results it is essential that the assumptions behind these tests are carefully examined and any violations rectified where possible, or a more appropriate statistical test chosen.

PubMed Disclaimer

Figures

**Figure 1**
**The distribution of P values for regression coefficients with a neuro-cognitive outcome.** The association of 237,152 probe log ratio values with a continuous neuro-cognitive outcome were obtained using linear regression controlling for gender. Figure 1 shows that there are more P values ≤ 0.05 than would be expected by chance and thus suggests that some probes are significantly associated with the neuro-cognitive outcome.

**Figure 2**
**The distribution of P values after a simulation exercise using permutation of outcome values.** Figure 2 shows there are no more P values ≤ 0.05 than would be expected by chance, resulting in a distribution similar to a Uniform distribution.

**Figure 3**
**The distribution of P values for regression coefficients with a body composition outcome.** Figure 3 shows that fewer P values ≤ 0.05 were obtained than would be expected by chance.

**Figure 4**
**Regression residuals for a body composition outcome plotted against values predicted by the regression.** Figure 4 shows the regression residuals increasing with predicted value (the points on the plot spread out in a fan shape), indicating heteroskedasticity.

**Figure 5**
**The distribution of P values with a body composition outcome using robust standard errors.** Figure 5 indicates that a number of probes are associated with the outcome of interest as there are more P values ≤ 0.05 than would be expected by chance.

**Figure 6**
**Comparison of P values ≤ 0.01 using classical linear regression recalculated using robust standard errors.** Figure 6 shows that the majority of these P values are still ≤ 0.01 after re-estimation.

See this image and copyright information in PMC

References

1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995;57(1):289–300.
1. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. - DOI - PMC - PubMed
1. Fodor AA, Tickle TL, Richardson C. Towards the uniform distribution of null P values on Affymetrix microarrays. Genome Biol. 2007;8:R69. doi: 10.1186/gb-2007-8-5-r69. - DOI - PMC - PubMed
1. Core Team R. R: A Language and Environment for Statistical Computing. 2012.
1. Huang S, Podsypanina K, Chen Y, Cai W, Tsimelzon A, Hilsenbeck S, Li Y. Wnt-1 is dominant over neu in specifying mammary tumor expression profiles. Technol Cancer Res Treat. 2006;5:565–571. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Affiliation

Correction of unexpected distributions of P values from analysis of whole genome arrays by rectifying violation of statistical assumptions

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources