Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Sep 24;15(1):315.
doi: 10.1186/1471-2105-15-315.

A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records

Affiliations

A random set scoring model for prioritization of disease candidate genes using protein complexes and data-mining of GeneRIF, OMIM and PubMed records

Li Jiang et al. BMC Bioinformatics. .

Abstract

Background: Prioritizing genetic variants is a challenge because disease susceptibility loci are often located in genes of unknown function or the relationship with the corresponding phenotype is unclear. A global data-mining exercise on the biomedical literature can establish the phenotypic profile of genes with respect to their connection to disease phenotypes. The importance of protein-protein interaction networks in the genetic heterogeneity of common diseases or complex traits is becoming increasingly recognized. Thus, the development of a network-based approach combined with phenotypic profiling would be useful for disease gene prioritization.

Results: We developed a random-set scoring model and implemented it to quantify phenotype relevance in a network-based disease gene-prioritization approach. We validated our approach based on different gene phenotypic profiles, which were generated from PubMed abstracts, OMIM, and GeneRIF records. We also investigated the validity of several vocabulary filters and different likelihood thresholds for predicted protein-protein interactions in terms of their effect on the network-based gene-prioritization approach, which relies on text-mining of the phenotype data. Our method demonstrated good precision and sensitivity compared with those of two alternative complex-based prioritization approaches. We then conducted a global ranking of all human genes according to their relevance to a range of human diseases. The resulting accurate ranking of known causal genes supported the reliability of our approach. Moreover, these data suggest many promising novel candidate genes for human disorders that have a complex mode of inheritance.

Conclusion: We have implemented and validated a network-based approach to prioritize genes for human diseases based on their phenotypic profile. We have devised a powerful and transparent tool to identify and rank candidate genes. Our global gene prioritization provides a unique resource for the biological interpretation of data from genome-wide association studies, and will help in the understanding of how the associated genetic variants influence disease or quantitative phenotypes.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Performance of the approach using different protein-protein interaction (PPI) confidence score thresholds. The influence of different PPI thresholds on the precision (red) and recall (black) is shown. The precision (y-axis) and recall (y-axis) were determined for each PPI threshold (x-axis) at the maximal Matthews correlation coefficient (MCC).
Figure 2
Figure 2
Influence of protein-protein interaction (PPI) thresholds on the prioritization of causal genes in the test sets. The proportion (y-axis) of prioritized test-sets where causal genes were ranked within the top five (black) or top one (red) is shown according to different PPI confidence score thresholds (x-axis).
Figure 3
Figure 3
Receiver operating characteristic (ROC) curves of prioritizations using different phenotype sources and vocabulary filters. Each ROC curve represents the prioritization performance when combining a specific gene-associated phenotype with a vocabulary filter. The phenotype sources were OMIM (brown), PubMed (green), and GeneRIF (purple). The vocabulary filters were STY, MeSH, ICD9CM, and GO (colored from dark to light accordingly).

References

    1. Chen J, Aronow BJ, Jegga AG. Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009;10:73. doi: 10.1186/1471-2105-10-73. - DOI - PMC - PubMed
    1. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R. Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol. 2010;6(1):e1000641. doi: 10.1371/journal.pcbi.1000641. - DOI - PMC - PubMed
    1. Brunner HG, van Driel MA. From syndrome families to functional genomics. Nat Rev Genet. 2004;5:545–551. doi: 10.1038/nrg1383. - DOI - PubMed
    1. Lage K, Karlberg EO, Størling ZM, Olason PI, Pedersen AG, Rigina O, Hinsby AM, Tümer Z, Pociot F, Tommerup N, Moreau Y, Brunak S. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol. 2007;25:309–316. doi: 10.1038/nbt1295. - DOI - PubMed
    1. Van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JAM. A text-mining analysis of the human phenome. Eur J Hum Genet. 2006;14:535–542. doi: 10.1038/sj.ejhg.5201585. - DOI - PubMed

Publication types

LinkOut - more resources