. 2010 May 6:3:17.

doi: 10.1186/1755-8794-3-17.

Predicting environmental chemical factors associated with disease-related gene expression data

Chirag J Patel¹, Atul J Butte

Affiliations

PMID: 20459635
PMCID: PMC2880288
DOI: 10.1186/1755-8794-3-17

Predicting environmental chemical factors associated with disease-related gene expression data

Chirag J Patel et al. BMC Med Genomics. 2010.

. 2010 May 6:3:17.

doi: 10.1186/1755-8794-3-17.

Authors

Chirag J Patel¹, Atul J Butte

Affiliation

¹ Department of Pediatrics, Stanford University School of Medicine, Stanford, CA 94305, USA.

PMID: 20459635
PMCID: PMC2880288
DOI: 10.1186/1755-8794-3-17

Abstract

Background: Many common diseases arise from an interaction between environmental and genetic factors. Our knowledge regarding environment and gene interactions is growing, but frameworks to build an association between gene-environment interactions and disease using preexisting, publicly available data has been lacking. Integrating freely-available environment-gene interaction and disease phenotype data would allow hypothesis generation for potential environmental associations to disease.

Methods: We integrated publicly available disease-specific gene expression microarray data and curated chemical-gene interaction data to systematically predict environmental chemicals associated with disease. We derived chemical-gene signatures for 1,338 chemical/environmental chemicals from the Comparative Toxicogenomics Database (CTD). We associated these chemical-gene signatures with differentially expressed genes from datasets found in the Gene Expression Omnibus (GEO) through an enrichment test.

Results: We were able to verify our analytic method by accurately identifying chemicals applied to samples and cell lines. Furthermore, we were able to predict known and novel environmental associations with prostate, lung, and breast cancers, such as estradiol and bisphenol A.

Conclusions: We have developed a scalable and statistical method to identify possible environmental associations with disease using publicly available data and have validated some of the associations in the literature.

PubMed Disclaimer

Figures

**Figure 1**
**Prediction database creation based on the Comparative Toxicogenomics Database (CTD)**. A.) The CTD contained 85,937 total unique chemical-gene relations over 4,078 chemicals and 15,461 genes. Each relation had one or more citations of support. An example hypothetical relation, "*TCDD* lead to *higher expression of CYP1A1* mRNA in *H. sapiens* as shown in *Anwar-Mohamed et al*" is seen on the right panel. B.) Creation of chemical-gene set relations. Each chemical-gene relation had a number of citations of support, x_i. For each chemical, we constructed a gene set, or "signature" from the individual chemical-gene relations. We filtered out signatures that had at least 5 genes in the set, leaving a total of 1,338 chemical-gene sets. An example of one chemical-gene set is seen on the right panel of B: the genes *CYP1A1*, *AHR, AHR2* are shown to have multiple citations for the relation, 60, 40, and 9 respectively.

**Figure 2**
**Predicting environmental chemical association to gene expression datasets**. A.) A representation of the 1338 chemical-gene sets in our prediction database. B.) For the validation step, we conducted SAM to find genes whose expression was altered in each of our datasets. We then mapped the differentially expressed genes to corresponding extra-species genes in our database by using Homologene. For each chemical-gene set signature, we conduct a hypergeometric test for enrichment and ranked each result by p-value. C.) We applied the approach used in B to predict chemical association to prostate, breast, and lung cancer data and validated these results with curated disease-chemical annotations from the CTD represented in D.). D.) Representation of the curated disease-chemical associations in the CTD.

**Figure 3**
**Clustering chemical prediction lists by biological activity archived in PubChem**. A.) A representation of the CTD and chemical-gene sets as shown in detail in Figure 2. B.) Prediction of the chemicals associated to each cancer dataset using chemical-gene sets from the CTD. We selected highly significant chemical predictions for each cancer and clustered these chemicals by their "Bioactivity" similarity as defined and computed in PubChem. C.) Within PubChem, each of these chemicals has a vector of standardized BioAssay scores. PubChem had 790 BioAssay scores for 66 of our significant predictions. The PubChem BioActivity similarity tool uses these vectors of scores to computes the biological activity similarity for each pair of chemicals and similarity is represented as a matrix.

**Figure 4**
**Curated disease-chemical enrichment versus prediction lists for prostate, lung, and breast cancer datasets**. For a prediction list, we selected chemicals that ranked within α = 10^-4, 10^-3, 10^-2, and 0.05. This -log10(threshold) along with number of total chemicals found (in parentheses) for each threshold is seen on the x-axis of each figure. We tested if these highly ranked chemicals found under each threshold were enriched for chemicals that had known curated association with the cancer in question. The -log10(p-value) for this enrichment is seen on the y-axis. The solid round red marker represents the enrichment test for the actual disease for which the predictions were based; the number underneath represents the total number of chemicals found in the prediction list that had a curated association with the disease and the percent found among all curated relations for that disease. We estimated accuracy and precision by computing disease-chemical enrichment for all other diseases; false positives are offset in black and true negatives are in yellow. The false positive rate is bracketed and in italics. Examples of false positives are annotated in blue italics along with the number of chemicals found in the prediction list corresponding to that disease and the percent found among all curated relations for that disease. We computed this validation enrichment for A.) prostate cancer, B.) lung cancer from nonsmokers, and C.) non-tumorigenic breast cancers.

**Figure 5**
**Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity**. Highly significant chemical prediction p-values for the prostate, lung, and breast cancer datasets (p = 0.001, 0.001, 0.01, for the prostate, lung, and breast cancer datasets) are reordered by their BioActivity similarity computed by PubChem. A column represents the cancer analyzed and each cell corresponds to the chemical-gene set association -log10(p-value). Examples of correlation between BioActivity similarity and chemical-gene set significance include the sodium arsenite, sodium arsenate, and Doxorubicin cluster (labeled in orange), the Genistein, Estradiol, Ethinyl Estradiol, and Diethylbisterol and Progesterone, Tretinoin, and Corticosterone clusters (labeled in purple). Other examples of BioActivity similarity and chemical-gene set association include chemicals vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride (outlined in blue).

See this image and copyright information in PMC

References

1. Schwartz D, Collins F. Medicine. Environmental biology and human disease. Science. 2007;316(5825):695–696. - PubMed
1. Davis AP, Murphy CG, Saraceni-Richards CA, Rosenstein MC, Wiegers TC, Mattingly CJ. Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009. pp. D786–792. - PMC - PubMed
1. Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res. 2007. pp. D760–765. - PMC - PubMed
1. Andrew AS, Jewell DA, Mason RA, Whitfield ML, Moore JH, Karagas MR. Drinking-water arsenic exposure modulates gene expression in human lymphocytes from a U.S. population. Environ Health Perspect. 2008;116(4):524–531. - PMC - PubMed
1. Malard V, Berenguer F, Prat O, Ruat S, Steinmetz G, Quemeneur E. Global gene expression profiling in human lung cells exposed to cobalt. BMC Genomics. 2007;8:147. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Predicting environmental chemical factors associated with disease-related gene expression data

Affiliation

Predicting environmental chemical factors associated with disease-related gene expression data

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources