Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Oct;12(10):1582-90.
doi: 10.1101/gr.116402.

Using text analysis to identify functionally coherent gene groups

Affiliations

Using text analysis to identify functionally coherent gene groups

Soumya Raychaudhuri et al. Genome Res. 2002 Oct.

Abstract

The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Scoring articles relative to gene groups. We graphically depict a small gene group of three autophagy genes (boxes with dotted boundaries). The genes are connected to their respective article references (boxes with solid boundaries). Articles about autophagy are dark boxes with white lettering. Notice that, for all genes, only a few of the referenced articles are about autophagy, the critical function that unites these genes in the group. The arrows are used to indicate the semantic neighbors of article B.2, an autophagy article. The significance of this article to the group's unifying function becomes apparent when we notice that many of its neighbors, also autophagy articles, are references for other genes in the same group.
Figure 2
Figure 2
Precision-recall plot for each of the functional coherence scoring methods. We used each method to score the functional coherence of the 19 functional gene groups and the 1900 random gene groups. We calculated and plotted precision and recall at cutoff scores of different stringency. There is a trade-off between precision and recall. More stringent cutoff values select fewer true functional groups, and recall (or sensitivity) is compromised; however, less stringent cutoff values cause many random groups to be selected inappropriately and precision is compromised. An ideal precision-recall plot achieves 100% precision for every value of recall. The neighbor divergence method is closest to the optimal curve.
Figure 3
Figure 3
Histogram of neighbor divergence scores. Each open square represents (≤) the count of random gene group scores in the range indicated on the horizontal axis; each closed diamond represents the count of functional gene group scores in the range on the horizontal axis. There is little overlap between the two histograms. None of the random gene groups score above .16; most of the functional gene groups score well above .16.
Figure 4
Figure 4
Observed and expected distribution of article scores. (A) The bar graph in the figure represents the observed empirical distribution of article scores for the “signal transduction” gene group. The line on the figure is the Poisson distribution; it is the expected distribution of scores for a random gene group of the same size. (B) The ratio in log scale of observed (bars in Fig. 4A) to expected (line in Fig. 4A) distribution of article scores. The X-axis is drawn at a ratio of one, where observed is equal to expected. Because the gene group represents a well-defined biological function, the distributions are very different. High-scoring articles that discuss signal transduction and low-scoring articles that discuss functions besides signal transduction are overrepresented.
Figure 5
Figure 5
Replacing functional genes with random genes reduces neighbor divergence scores gracefully. We replaced genes in two functional gene groups (“autophagy” and “ion homeostasis”) with random genes, and scores were recalculated for the corrupted groups. Each point represents 10 scores; error bars indicate 95% confidence interval of scores for that many genes replaced. Neighbor divergence scores above .1 are very significant (see Fig. 3). Neighbor divergence scores remain significant despite replacement of about 38% (6 of 16 genes) of the “autophagy” genes and 60% (26 of 43 genes) of the “ion homeostasis” genes.

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Andrade MA, Valencia A. Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system. Proc Int Conf Intell Syst Mol Biol. 1997;5:25–32. - PubMed
    1. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25:25–29. - PMC - PubMed
    1. Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res. 1999;27:49–54. - PMC - PubMed

Publication types

MeSH terms