. 2002 Sep 13;3(10):RESEARCH0055.

doi: 10.1186/gb-2002-3-10-research0055. Epub 2002 Sep 13.

Mining microarray expression data by literature profiling

Damien Chaussabel¹, Alan Sher

Affiliations

Affiliation

¹ Immunobiology Section, Laboratory of Parasitic Diseases, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA. dchaussabel@niaid.nih.gov

PMID: 12372143
PMCID: PMC134484
DOI: 10.1186/gb-2002-3-10-research0055

Mining microarray expression data by literature profiling

Damien Chaussabel et al. Genome Biol. 2002.

. 2002 Sep 13;3(10):RESEARCH0055.

doi: 10.1186/gb-2002-3-10-research0055. Epub 2002 Sep 13.

Authors

Damien Chaussabel¹, Alan Sher

Affiliation

¹ Immunobiology Section, Laboratory of Parasitic Diseases, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA. dchaussabel@niaid.nih.gov

PMID: 12372143
PMCID: PMC134484
DOI: 10.1186/gb-2002-3-10-research0055

Abstract

Background: The rapidly expanding fields of genomics and proteomics have prompted the development of computational methods for managing, analyzing and visualizing expression data derived from microarray screening. Nevertheless, the lack of efficient techniques for assessing the biological implications of gene-expression data remains an important obstacle in exploiting this information.

Results: To address this need, we have developed a mining technique based on the analysis of literature profiles generated by extracting the frequencies of certain terms from thousands of abstracts stored in the Medline literature database. Terms are then filtered on the basis of both repetitive occurrence and co-occurrence among multiple gene entries. Finally, clustering analysis is performed on the retained frequency values, shaping a coherent picture of the functional relationship among large and heterogeneous lists of genes. Such data treatment also provides information on the nature and pertinence of the associations that were formed.

Conclusions: The analysis of patterns of term occurrence in abstracts constitutes a means of exploring the biological significance of large and heterogeneous lists of genes. This approach should contribute to optimizing the exploitation of microarray technologies by providing investigators with an interface between complex expression data and large literature resources.

PubMed Disclaimer

Figures

**Figure 1**
Gene-specific and baseline term occurrences in the literature. The literature-mining technique we describe compares term occurrence in a collection of abstracts relating to a specific gene to their occurrence in an unbiased set of abstracts (baseline occurrence in the literature). In the example illustrated here, the occurrence values for terms present in more than 25% of the abstracts relating to the gene RANTES are plotted on the y-axis. To determine baseline occurrence, occurrence values found in the literature concerning this gene are then averaged with values found for an increasing number of genes chosen randomly from all known human genes indexed in the LocusLink database (x-axis). Terms with high occurrence values in the collection of abstracts relating to RANTES and a low baseline occurrence in the literature are plotted in green.

**Figure 2**
Analysis of patterns of term occurrence in abstracts. After filters have been applied to the original list, selected term-occurrence values relating to each gene are analyzed. Terms (columns) and genes (rows) were grouped on the basis of similarities between patterns of term occurrence in abstracts by hierarchical clustering. Some of the areas of the clustergram are shown in detail. Clusters are further referenced by color codes: blue, 'nuclear factors'; orange, 'receptor-ligand pair'; green, 'interferon-related'; red, 'chemokines'; violet, 'MHC class I antigen-presentation pathway'. Shades of yellow indicate different levels of term occurrence in abstracts.

**Figure 3**
Annotated dendrogram obtained by clustering term-occurrence values relative to each gene. The corresponding clustergram is shown in Figure 2. Genes are arranged according to patterns of term occurrence. Distances between nodes of the tree diagram indicate the degree of association between genes or groups of genes. A subset of representative terms used in the analysis was chosen to annotate this list of genes. Shades of yellow indicate different levels of term occurrence in abstracts. Table 1 lists the gene abbreviations used.

**Figure 4**
The degree of association found among groups of genes by literature profiling correlates with their likelihood of being related. **(a)** The clustergram resulting from the analysis of the list of co-induced genes used to illustrate the mining technique is given for comparison. **(b)** A group of 50 genes was picked at random from all known human genes listed in the LocusLink database and their literature content was analyzed. **(c)** A group of 50 genes was picked at random from the list of known interleukins, chemokines and chemokine receptors and subjected to a similar analysis. The number of positive gene-term associations retained after filtering (term occurrence for a given gene higher than the baseline by 25%) is shown for each group. Numbers of shared terms for (a), (b) and (c), was 101, 49 and 116, respectively.

**Figure 5**
Conditions for the emergence of groups of related genes. **(a)** Groups of related genes found by clustering term-occurrence values. The color code is similar to the one used in Figure 2. **(b)** Grouping is conserved after gene names or terms making up gene names are removed from the analysis (for example, NFkappaB, RANTES, interferon, vascular, MIG). **(c)** Associations shown in (a) disappear when occurrence values are permuted for each of the genes, suggesting that associations made through the analysis of patterns of term occurrence do not arise by chance from a sufficiently high number of co-occurring terms.

**Figure 6**
Profiling the bacteria-induced macrophage activation program. Literature profiles were generated for a list of nearly 200 genes constituting the 'common transcriptional program', induced in human macrophages upon bacterial infection ([12], see also Additional data files). The clustergram generated for the analysis of patterns of term occurrence is shown at top left. **(a-g)** Detailed views for groups of genes (columns) sharing a common vocabulary (rows). Groups of terms were selected on the basis of clustering hierarchy whereas the number of genes shown in the inserts is arbitrary. For gene abbreviations see Additional data files.

**Figure 7**
Profiling classic medulloblastomas. Literature profiles were generated for a list of 200 genes found to be differentially expressed by classic versus desmoplastic medulloblastomas in a study of central nervous system embryonal tumors recently published by Pomeroy *et al*. ([19] and see Additional data files). The clustergram generated for the analysis of patterns of term occurrence is shown at top left. **(a-i)** Detailed views for groups of genes (columns) found to share a common vocabulary (rows). Groups of terms were selected on the basis of clustering hierarchy, whereas the number of genes shown in the inserts is arbitrary. For gene abbreviations see Additional data files.

See this image and copyright information in PMC

Cited by

The computational analysis of scientific literature to define and recognize gene expression clusters.
Raychaudhuri S, Chang JT, Imam F, Altman RB. Raychaudhuri S, et al. Nucleic Acids Res. 2003 Aug 1;31(15):4553-60. doi: 10.1093/nar/gkg636. Nucleic Acids Res. 2003. PMID: 12888516 Free PMC article.
Gene expression in cortex and hippocampus during acute pneumococcal meningitis.
Coimbra RS, Voisin V, de Saizieu AB, Lindberg RL, Wittwer M, Leppert D, Leib SL. Coimbra RS, et al. BMC Biol. 2006 Jun 2;4:15. doi: 10.1186/1741-7007-4-15. BMC Biol. 2006. PMID: 16749930 Free PMC article.
GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies.
Zhang B, Schmoyer D, Kirov S, Snoddy J. Zhang B, et al. BMC Bioinformatics. 2004 Feb 18;5:16. doi: 10.1186/1471-2105-5-16. BMC Bioinformatics. 2004. PMID: 14975175 Free PMC article.
Martini: using literature keywords to compare gene sets.
Soldatos TG, O'Donoghue SI, Satagopam VP, Jensen LJ, Brown NP, Barbosa-Silva A, Schneider R. Soldatos TG, et al. Nucleic Acids Res. 2010 Jan;38(1):26-38. doi: 10.1093/nar/gkp876. Epub 2009 Oct 25. Nucleic Acids Res. 2010. PMID: 19858102 Free PMC article.
A sentence sliding window approach to extract protein annotations from biomedical articles.
Krallinger M, Padron M, Valencia A. Krallinger M, et al. BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S19. doi: 10.1186/1471-2105-6-S1-S19. Epub 2005 May 24. BMC Bioinformatics. 2005. PMID: 15960831 Free PMC article.

See all "Cited by" articles

References

1. Schulze A, Downward J. Navigating gene expression using microarrays - a technology review. Nat Cell Biol. 2001;3:E190–E195. - PubMed
1. Schulze A, Downward J. Analysis of gene expression by microarrays: cell biologist's gold mine or minefield? J Cell Sci. 2000;113:4151–4156. - PubMed
1. Masys DR, Welsh JB, Lynn Fink J, Gribskov M, Klacansky I, Corbeil J. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics. 2001;17:319–326. - PubMed
1. Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN. MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27:1210–1214. - PubMed
1. Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28:21–28. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mining microarray expression data by literature profiling

Affiliation

Mining microarray expression data by literature profiling

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources