G2D: a tool for mining genes associated with disease

Carolina Perez-Iratxeta¹, Matthias Wjst, Peer Bork, Miguel A Andrade

Affiliations

PMID: 16115313
PMCID: PMC1208881
DOI: 10.1186/1471-2156-6-45

G2D: a tool for mining genes associated with disease

Carolina Perez-Iratxeta et al. BMC Genet. 2005.

. 2005 Aug 22:6:45.

doi: 10.1186/1471-2156-6-45.

Authors

Carolina Perez-Iratxeta¹, Matthias Wjst, Peer Bork, Miguel A Andrade

Affiliation

¹ Ontario Genomics Innovation Centre, Ottawa Health Research Institute, ON K1H 8L6, Ottawa, Canada. cperez-iratxeta@ohri.ca

PMID: 16115313
PMCID: PMC1208881
DOI: 10.1186/1471-2156-6-45

Abstract

Background: Human inherited diseases can be associated by genetic linkage with one or more genomic regions. The availability of the complete sequence of the human genome allows examining those locations for an associated gene. We previously developed an algorithm to prioritize genes on a chromosomal region according to their possible relation to an inherited disease using a combination of data mining on biomedical databases and gene sequence analysis.

Results: We have implemented this method as a web application in our site G2D (Genes to Diseases). It allows users to inspect any region of the human genome to find candidate genes related to a genetic disease of their interest. In addition, the G2D server includes pre-computed analyses of candidate genes for 552 linked monogenic diseases without an associated gene, and the analysis of 18 asthma loci.

Conclusion: G2D can be publicly accessed at http://www.ogic.ca/projects/g2d_2/.

PubMed Disclaimer

Figures

**Figure 1**
**The G2D algorithm**. The cylinders represent public databases. **MEDLINE** contains references to scientific literature annotated at the National Library of Medicine with terms from the MeSH ontology. For each disease being studied we take the MeSH C terms ('Diseases Category') from the publications associated in **OMIM** [3] as its keywords. For each gene we take the Gene Ontology (GO) terms [8] associated to its product in the **RefSeq** protein database [34] as its keywords. MEDLINE does not contain enough clinical literature to allow us to directly relate every symptom, represented by a MeSH C term, to every gene feature, represented by a GO term. Taking into account that genes relate to phenotypes by means of molecules, we can increase the robustness of the gene/phenotype relations using an intermediate association step through the MeSH D category of 'Chemicals & Drugs' (top). Accordingly, we first compute associations between MeSH C terms ('Diseases') and MeSH D terms ('Chemicals & Drugs') by their co-annotation on the same record, more specifically looking for dependences of MeSH D terms on MeSH C terms. For example, we would deduce a relation between "Alzheimer's disease" (MeSH C) and "Amyloid protein" (MeSH D) if the presence of the C term in a MEDLINE entry always implies the presence of the D term. Records in the RefSeq database contain annotations from GO that describe the protein function, and will often include a link to MEDLINE, mostly dealing with the experimental characterization of the protein. We use these links to relate MeSH D terms from the MEDLINE reference to GO terms from the sequence, again looking for GO term dependence on a MeSH D term. In this case we could deduce an association between the MeSH D term "Amyloid Protein" and the GO term "Amyloid Protein". Finally, we combine both sets of relations to obtain associations between MeSH C terms and GO terms (for example, the relation of Alzheimer's disease to the amyloid protein). To evaluate the genes associated with a particular disease we follow two directions. First, we deduce the gene functions (GO terms) related to the disease using the associations from phenotypes (MeSH C terms) describing the disease. For this, we collect the MeSH C terms found in the MEDLINE references from its corresponding OMIM entry (left), score all GO terms according to their relation to the terms in the MeSH C list (top), and finally, score all the proteins in RefSeq with the average of scores of their GO terms (right). For example, the analysis of late-onset familial Alzheimer disease (LOFAD) [9] would start by characterizing the disease with the MeSH C term "Alzheimer's Disease" among others. This would point to a series of GO terms including "Amyloid Protein" as a likely related function. One of the most related sequences in RefSeq (according to its GO annotations) would be the human amyloid beta A4 precursor protein-binding, which is annotated with the GO-term "amyloid protein". The other component of the analysis is a BLAST homology search [35] of the human genome region where the disease is mapped against the sequences stored in the RefSeq database (bottom). All hits in the region (red block) below a cut-off of E-value of 10e-10 are registered and sorted according to the score of the RefSeq protein they hit. Following our example, the analysis of the region where the LOFAD was mapped would show a gene similar to the human amyloid beta A4 precursor protein-binding annotated with the GO-term "amyloid protein": the APBA3 gene, which interacts with the Alzheimer's beta-amyloid precursor protein [12]. The analysis of LOFAD is extensively described in the Results section. Further details of the method are given in [2] and in the G2D web site.

**Figure 2**
**Example of analysis of a monogenic disease**. (a) The data defining the phenotype of the disease (in this case the OMIM identifier of an equivalent disease) and the region where it was mapped are given in the COMBO box. (b) The results window displays the MeSH C terms derived from the links to MEDLINE found in the OMIM entry, and the resulting scores for the GO terms. The green arrows allow traveling the MeSH C/MeSH D/GO network of connections back and forth. (c) Further down in the results window, the list of candidates displays the position of the BLASTx hits [35] in the chromosomal region (dark green bar over the light green bar) and of the hits in the matching protein sequence (dark red bars over the light red bar). Each hit in the genome is linked to the UCSC Genome Browser ("U" link). (d) The UCSC Genome Browser allows examining the genes known or predicted that overlap with the match linking to very useful databases and resources.

See this image and copyright information in PMC

References

1. Dean M. Approaches to identify genes for complex human diseases: lessons from Mendelian disorders. Hum Mutat. 2003;22:261–274. doi: 10.1002/humu.10259. - DOI - PubMed
1. Perez-Iratxeta C, Bork P, Andrade MA. Association of genes to genetically inherited diseases using data mining. Nat Genet. 2002;31:316–319. - PubMed
1. Wheeler DL, Church DM, Edgar R, Federhen S, Helmberg W, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Suzek TO, Tatusova TA, Wagner L. Database resources of the National Center for Biotechnology Information: update. Nucleic Acids Res. 2004;32:D35–D40. doi: 10.1093/nar/gkh073. - DOI - PMC - PubMed
1. MeSH [http://www.nlm.nih.gov/mesh/]
1. van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur J Hum Genet. 2003;11:57–63. doi: 10.1038/sj.ejhg.5200918. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

G2D: a tool for mining genes associated with disease

Affiliation

G2D: a tool for mining genes associated with disease

Authors

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical