Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Aug 1;31(15):4553-60.
doi: 10.1093/nar/gkg636.

The computational analysis of scientific literature to define and recognize gene expression clusters

Affiliations

The computational analysis of scientific literature to define and recognize gene expression clusters

Soumya Raychaudhuri et al. Nucleic Acids Res. .

Abstract

A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present a computational method that leverages the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in the analysis of gene expression data offers an opportunity to incorporate functional information about the genes when defining expression clusters. We have created a method that associates gene expression profiles with known biological functions. Our method has two steps. First, we apply hierarchical clustering to the given gene expression data set. Secondly, we use text from abstracts about genes to (i) resolve hierarchical cluster boundaries to optimize the functional coherence of the clusters and (ii) recognize those clusters that are most functionally coherent. In the case where a gene has not been investigated and therefore lacks primary literature, articles about well-studied homologous genes are added as references. We apply our method to two large gene expression data sets with different properties. The first contains measurements for a subset of well-studied Saccharomyces cerevisiae genes with multiple literature references, and the second contains newly discovered genes in Drosophila melanogaster; many have no literature references at all. In both cases, we are able to rapidly define and identify the biologically relevant gene expression profiles without manual intervention. In both cases, we identified novel clusters that were not noted by the original investigators.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Hierarchical clustering and cluster boundary definition. A schematic of hierarchically clustered expression data with subsequent cluster boundary definition. On the right are gene expression data represented as a colored grid. Each row in the grid represents the expression of a single gene across multiple conditions; each column represents the expression of each of the genes in a specific condition. Red squares indicate gene induction, while green squares indicate repression. On the left is a tree generated by a hierarchical clustering algorithm. The tree consists of nodes (dark boxes) that organize the genes according to expression similarity. All of the genes that descend from one node are the genes in the candidate cluster defined by that node. In this schematic, we illustrate a pruning of the tree into four disjoint biologically relevant gene clusters. Pruning the tree defines concrete clusters and their boundaries. After clustering the data, one must identify the biologically significant candidate clusters. Typically, careful expert examination of the genes in the clusters is required to identify the critical clusters in which the genes share function and to draw cluster boundaries that respect biological function. We assert that scientific literature can be mined automatically instead to identify biologically consistent clusters, and to draw cluster boundaries that respect biological function.
Figure 2
Figure 2
NDPG score correlates with cluster functional coherence. (A) After clustering the yeast gene expression data into 2466 nodes, we have plotted the literature-based NDPG score of the 1150 nodes containing 6–200 genes on the x-axis and the highest percentage concordance with a GO functional group on the y-axis. Black circles indicate the nodes selected by the computational method. (B) Similar to (A), except we have plotted the NDPG score and the highest percentage concordance with a GO functional group for the clusters containing threonine endopeptidase genes. The cluster selected by the algorithm is the black circle; other points represent nodes that are the ancestors and descendants of the selected node containing subsets or supersets of the genes in the selected node. (C) Similar plot for nodes containing heat shock genes. (D) Similar plot for nodes containing cytoplasmic ribosome genes.
Figure 2
Figure 2
NDPG score correlates with cluster functional coherence. (A) After clustering the yeast gene expression data into 2466 nodes, we have plotted the literature-based NDPG score of the 1150 nodes containing 6–200 genes on the x-axis and the highest percentage concordance with a GO functional group on the y-axis. Black circles indicate the nodes selected by the computational method. (B) Similar to (A), except we have plotted the NDPG score and the highest percentage concordance with a GO functional group for the clusters containing threonine endopeptidase genes. The cluster selected by the algorithm is the black circle; other points represent nodes that are the ancestors and descendants of the selected node containing subsets or supersets of the genes in the selected node. (C) Similar plot for nodes containing heat shock genes. (D) Similar plot for nodes containing cytoplasmic ribosome genes.
Figure 2
Figure 2
NDPG score correlates with cluster functional coherence. (A) After clustering the yeast gene expression data into 2466 nodes, we have plotted the literature-based NDPG score of the 1150 nodes containing 6–200 genes on the x-axis and the highest percentage concordance with a GO functional group on the y-axis. Black circles indicate the nodes selected by the computational method. (B) Similar to (A), except we have plotted the NDPG score and the highest percentage concordance with a GO functional group for the clusters containing threonine endopeptidase genes. The cluster selected by the algorithm is the black circle; other points represent nodes that are the ancestors and descendants of the selected node containing subsets or supersets of the genes in the selected node. (C) Similar plot for nodes containing heat shock genes. (D) Similar plot for nodes containing cytoplasmic ribosome genes.
Figure 2
Figure 2
NDPG score correlates with cluster functional coherence. (A) After clustering the yeast gene expression data into 2466 nodes, we have plotted the literature-based NDPG score of the 1150 nodes containing 6–200 genes on the x-axis and the highest percentage concordance with a GO functional group on the y-axis. Black circles indicate the nodes selected by the computational method. (B) Similar to (A), except we have plotted the NDPG score and the highest percentage concordance with a GO functional group for the clusters containing threonine endopeptidase genes. The cluster selected by the algorithm is the black circle; other points represent nodes that are the ancestors and descendants of the selected node containing subsets or supersets of the genes in the selected node. (C) Similar plot for nodes containing heat shock genes. (D) Similar plot for nodes containing cytoplasmic ribosome genes.
Figure 3
Figure 3
Top 20 yeast gene clusters in order of literature-based functional coherence. To check if these clusters correspond to groups of genes with shared function, we correlate the clusters with yeast GO codes. On the left of the graphic, we list the literature-based NDPG score of each cluster and the number of genes within the cluster. On the right, we list the GO code that best corresponds to the cluster. The length of the green bar in the graphic is proportional to the number of genes in the cluster that are also assigned the GO function listed on the right. The length of the yellow bar is proportional to the number of genes in the cluster not assigned the corresponding function by GO. The length of the blue bar is proportional to the number of additional genes assigned the GO function that are not in the cluster. The longer the green bar, the better the cluster represents that specific function.
Figure 4
Figure 4
Four examples of gene expression clusters from a fly development time course whose boundaries were defined with scientific literature. The gene expression conditions are annotated at the top with E (embryo), L (larvae), P (pupae), M (adult male) and F (adult female). On the right, genes are listed by FlyBase ID and name if available. On the far right, we have listed the appropriate GO code annotation for that gene if available. (A) Nucleolar maternal genes. This cluster had not been identified in the original publication. (B) Photoreceptor genes. We found two separate photoreceptor clusters, as did the authors of the original publication. (C) Citric acid cycle genes. Most of these genes have not yet been studied. Using sequence homology to obtain additional references made it feasible to identify this cluster of genes. A related but broader cluster was identified in the original publication. (D) Muscle-specific genes. A similar but broader cluster was identified in the original publication containing more unknown genes.

Similar articles

Cited by

References

    1. Alizadeh A.A., Eisen,M.B., Davis,R.E., Ma,C., Lossos,I.S., Rosenwald,A., Boldrick,J.C., Sabet,H., Tran,T. and Yu,X. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511. - PubMed
    1. Bittner M., Meltzer,P., Chen,Y., Jiang,Y., Seftor,E., Hendrix,M., Radmacher,M., Simon,R., Yakhini,Z., Ben-Dor,A. et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406, 536–540. - PubMed
    1. Golub T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. - PubMed
    1. Arbeitman M.N., Furlong,E.E., Imam,F., Johnson,E., Null,B.H., Baker,B.S., Krasnow,M.A., Scott,M.P., Davis,R.W. and White,K.P. (2002) Gene expression during the life cycle of Drosophila melanogaster. Science, 297, 2270–2275. - PubMed
    1. Zou S., Meadows,S., Sharp,L., Jan,L.Y. and Jan,Y.N. (2000) Genome-wide study of aging and oxidative stress response in Drosophila melanogaster. Proc. Natl Acad. Sci. USA, 97, 13726–13731. - PMC - PubMed

Publication types