Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;13 Suppl 8(Suppl 8):S23.
doi: 10.1186/1471-2164-13-S8-S23. Epub 2012 Dec 17.

Literature aided determination of data quality and statistical significance threshold for gene expression studies

Affiliations

Literature aided determination of data quality and statistical significance threshold for gene expression studies

Lijing Xu et al. BMC Genomics. 2012.

Abstract

Background: Gene expression data are noisy due to technical and biological variability. Consequently, analysis of gene expression data is complex. Different statistical methods produce distinct sets of genes. In addition, selection of expression p-value (EPv) threshold is somewhat arbitrary. In this study, we aimed to develop novel literature based approaches to integrate functional information in analysis of gene expression data.

Methods: Functional relationships between genes were derived by Latent Semantic Indexing (LSI) of Medline abstracts and used to calculate the function cohesion of gene sets. In this study, literature cohesion was applied in two ways. First, Literature-Based Functional Significance (LBFS) method was developed to calculate a p-value for the cohesion of differentially expressed genes (DEGs) in order to objectively evaluate the overall biological significance of the gene expression experiments. Second, Literature Aided Statistical Significance Threshold (LASST) was developed to determine the appropriate expression p-value threshold for a given experiment.

Results: We tested our methods on three different publicly available datasets. LBFS analysis demonstrated that only two experiments were significantly cohesive. For each experiment, we also compared the LBFS values of DEGs generated by four different statistical methods. We found that some statistical tests produced more functionally cohesive gene sets than others. However, no statistical test was consistently better for all experiments. This reemphasizes that a statistical test must be carefully selected for each expression study. Moreover, LASST analysis demonstrated that the expression p-value thresholds for some experiments were considerably lower (p < 0.02 and 0.01), suggesting that the arbitrary p-values and false discovery rate thresholds that are commonly used in expression studies may not be biologically sound.

Conclusions: We have developed robust and objective literature-based methods to evaluate the biological support for gene expression experiments and to determine the appropriate statistical significance threshold. These methods will assist investigators to more efficiently extract biologically meaningful insights from high throughput gene expression experiments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the LBFS algorithm. A statistical test was applied to get differentially expressed genes (DEGs) from the original labeled (OL) and permutated labeled (PL) samples. Subsets of 50 genes were randomly selected 1000 times from each pool of DEGs. Then literature p-values (LPvs) were calculated for each 50 gene-set. A Fisher's Exact test was used to determine if the proportion (called LCI) of subsets with LPv <0.5 in the OL group was significantly different from that obtained from PL group.
Figure 2
Figure 2
Relationship between EPV and LCI. The fraction of gene sets with LPv < 0.05 (y-axis) was plotted at various expression p-value (EPv) thresholds (x-axis) for 3 different datasets. Inset shows magnified view for EPv < 0.10.
Figure 3
Figure 3
Relationship between EPV and LCI at various thresholds. The LCI at various LPv thresholds ranging from 0.01 to 0.1 (y-axis) was plotted against various EPv thresholds (x-axis) for PGC-1beta dataset. Inset shows magnified view for EPv < 0.10. The shapes of the curves are similar at various LPv thresholds.

Similar articles

Cited by

References

    1. Luo J, Schumacher M, Scherer A, Sanoudou D, Megherbi D, Davison T, Shi T, Tong W, Shi L, Hong H, Zhao C, Elloumi F, Shi W, Thomas R, Lin S, Tillinghast G, Liu G, Zhou Y, Herman D, Li Y, Deng Y, Fang H, Bushel P, Woods M, Zhang J. A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data. Pharmacogenomics J. 2010;10:278–291. doi: 10.1038/tpj.2010.57. - DOI - PMC - PubMed
    1. Scherer A. Batch Effects and Noise in Microarray Experiments: Sources and Solutions. Wiley Series Probability Statistics. 2009.
    1. Chen JJ, Hsueh HM, Delongchamp RR, Lin CJ, Tsai CA. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinformatics. 2007;8:412. doi: 10.1186/1471-2105-8-412. - DOI - PMC - PubMed
    1. Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK. et al.The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239. - DOI - PMC - PubMed
    1. Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, Su Z, Chu TM, Goodsaid FM, Pusztai L, Shaughnessy JD, Oberthuer A, Thomas RS, Paules RS, Fielden M, Barlogie B, Chen W, Du P, Fischer M, Furlanello C, Gallas BD, Ge X, Megherbi DB, Symmans WF, Wang MD, Zhang J, Bitter H, Brors B, Bushel PR, Bylesjo M. et al.The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol. 2010;28:827–838. doi: 10.1038/nbt.1665. - DOI - PMC - PubMed

Publication types

LinkOut - more resources