Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2005 May;3(5):e134.
doi: 10.1371/journal.pbio.0030134. Epub 2005 Apr 5.

Systematic association of genes to phenotypes by genome and literature mining

Affiliations
Comparative Study

Systematic association of genes to phenotypes by genome and literature mining

Jan O Korbel et al. PLoS Biol. 2005 May.

Abstract

One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene-phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases.

PubMed Disclaimer

Figures

Figure 1
Figure 1. A Systematic and Unbiased Approach Combines Literature Mining and Comparative Genome Analysis with Associate Genes and Phenotypes
Words likely to describe phenotypic characteristics, that is, those preferentially co-occurring with certain species, are retrieved from MEDLINE abstracts. Phyletic distributions of genes are obtained using OGs from STRING [8]. As an example, phyletic distributions across bacteria for selected words and genes are shown in (I): we show species–word association scores for the words “flagellum”, “flagellin”, and “sewage”, as well as presence/absence patterns of the selected genes fliR (COG1684) and fliQ (COG1987). Species–word association scores greater than 0 indicate that a word is likely to describe a trait of the species (colours indicate that green = true positive, i.e., the flagellar phenotype was correctly inferred; yellow = false negative; red = false positive). (I′) Black and grey bars indicate OG presence in a species (tree shown in [I′′]), while grey bars indicate presumably inactive genes [34,35]. To identify informative phyletic distributions of traits and OGs, both species–word association and species–OG occurrence vectors are transformed using PCA. The similarity of the resulting transformed and normalized word and OG vectors (i.e., the word–OG association score) is computed from their inner vector products. A “heat map” (II) shows the distribution of word–OG association scores for the more than 300 words (y-axis) and over 500 OGs (x-axis) that reveal at least one significant, high-confidence association. Dendrograms are constructed by means linkage analysis, independently applying the inner products of transformed and normalized word and OG vectors as similarities. Clusters of associated words and OGs include many previously known trait–gene relationships. For example, terms mainly related to flagellar motility form a cluster with 29 OGs known to be involved in movement; see (III). Abbreviations: Flagellum, flagellar function; Ct, involved in chemotaxis.
Figure 2
Figure 2. Assessment of Prediction Quality
The figure demonstrates cumulative fractions of predicted OG–word associations that agree with previously known word–gene relationships (as extracted from MEDLINE). Independently confirmed predictions are enriched for high word–OG association scores.
Figure 3
Figure 3. Associations between Trait-Descriptive Words and OGs for Two Illustrative Clusters
“Heat maps” display word–OG association scores (scores greater than 0 are indicated; negative values are set to 0). We considered all words and OGs contributing to the respective cluster with at least one high-confidence association. Protein interaction networks, shown below, were derived from genomic context analysis (see Materials and Methods). (A) Traits and genes related to plant constituent degradation. Functional descriptions are: Plant-degr., involved in plant constituent degradation; Ox, putative oxidoreductases; Arg, Arginine degradation protein/predicted deacylase; UV, UV damage repair endonuclease; those with no description are uncharacterized. Terms related to sporulation reflect a domination of exo- and endospore-forming species from different genera (e.g., Streptomyces, Bacillus, and Clostridium) in these degradation processes. (B) Traits and genes related to food spoilage and poisoning. Some proteins have previously been implicated in virulence of food pathogens such as ManR (“T”), a transcriptional antiterminator involved in resistance to natural food preservatives, and some propanediol degradation proteins (“Prop-diol”). We suggest the involvement of additional proteins in pathogenicity: for example, ethanolamine degradation proteins (“Eth.-amine-usage”; the phospholipid phosphatidyl-ethanolamine, cleaved to ethanolamine by phospholipase, is abundant in the gut [14]); the cobalt chelatase CbiK (“C”; cobalt is an essential factor for propanediol and ethanolamine utilization [14]); a phosphotransferase system (“PTS”) involved in sorbitol transport [36] (sorbitol is an artificial food sweetener naturally found in fruits and may act as an additional carbon source; we suggest that alternatively the chemically similar inositol, cleavage product of another abundant phospholipid, may be utilized). Other proteins that may also be involved are a presumably anaerobically used butyrate kinase (“B”), gamma-glutamylcysteine synthetase (“G”), an electron transport complex protein (“O”), a predicted metal-binding enzyme (“E”), and several uncharacterized proteins (no description).
Figure 4
Figure 4. Phyletic Distributions across Bacteria of Genes and Associated Representative Trait-Descriptive Words Related to Food, Food Spoilage, and Food Poisoning (Cluster 1)
The complete figure, including phylogenetic distributions of all trait-descriptive words and OGs in Cluster 1, is available online as Figure S1. Black squares indicate gene occurrences across species for the respective OGs. Blue squares indicate predicted associations between trait-descriptive words and species (species–word association scores greater than 0). Function descriptions (grey bar) are the same as in Figure 3.

References

    1. Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, et al. Predicting function: From genes to genomes and back. J Mol Biol. 1998;283:707–725. - PubMed
    1. Huynen M, Dandekar T, Bork P. Differential genome analysis applied to the species-specific features of Helicobacter pylori . FEBS Lett. 1998;426:1–5. - PubMed
    1. Makarova KS, Wolf YI, Koonin EV. Potential genomic determinants of hyperthermophily. Trends Genet. 2003;19:172–176. - PubMed
    1. Jim K, Parmar K, Singh M, Tavazoie S. A cross-genomic approach for systematic mapping of phenotypic traits to genes. Genome Res. 2004;14:109–115. - PMC - PubMed
    1. Levesque M, Shasha D, Kim W, Surette MG, Benfey PN. Trait-to-gene: A computational method for predicting the function of uncharacterized genes. Curr Biol. 2003;13:129–133. - PubMed

Publication types