. 2013 May 1;8(5):e58977.

doi: 10.1371/journal.pone.0058977. Print 2013.

Prediction and validation of gene-disease associations using methods inspired by social network analyses

U Martin Singh-Blom¹, Nagarajan Natarajan, Ambuj Tewari, John O Woods, Inderjit S Dhillon, Edward M Marcotte

Affiliations

PMID: 23650495
PMCID: PMC3641094
DOI: 10.1371/journal.pone.0058977

Prediction and validation of gene-disease associations using methods inspired by social network analyses

U Martin Singh-Blom et al. PLoS One. 2013.

. 2013 May 1;8(5):e58977.

doi: 10.1371/journal.pone.0058977. Print 2013.

Authors

U Martin Singh-Blom¹, Nagarajan Natarajan, Ambuj Tewari, John O Woods, Inderjit S Dhillon, Edward M Marcotte

Affiliation

¹ Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas, Austin, Texas, United States of America.

PMID: 23650495
PMCID: PMC3641094
DOI: 10.1371/journal.pone.0058977

Erratum in

PLoS One. 2013;8(9). doi:10.1371/annotation/5aeb88a0-1630-4a07-bb49-32cb5d617af1

Abstract

Correctly identifying associations of genes with diseases has long been a goal in biology. With the emergence of large-scale gene-phenotype association datasets in biology, we can leverage statistical and machine learning methods to help us achieve this goal. In this paper, we present two methods for predicting gene-disease associations based on functional gene associations and gene-phenotype associations in model organisms. The first method, the Katz measure, is motivated from its success in social network link prediction, and is very closely related to some of the recent methods proposed for gene-disease association inference. The second method, called Catapult (Combining dATa Across species using Positive-Unlabeled Learning Techniques), is a supervised machine learning method that uses a biased support vector machine where the features are derived from walks in a heterogeneous gene-trait network. We study the performance of the proposed methods and related state-of-the-art methods using two different evaluation strategies, on two distinct data sets, namely OMIM phenotypes and drug-target interactions. Finally, by measuring the performance of the methods using two different evaluation strategies, we show that even though both methods perform very well, the Katz measure is better at identifying associations between traits and poorly studied genes, whereas Catapult is better suited to correctly identifying gene-trait associations overall [corrected].

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. The combined network in the neighborhood of a human disease.**
The local network around the human disease diabetes insipidus and two genes highly ranked by Catapult, *AQP1* (top ranked candidate) and *MYBL2* (ranked as number 40). *AQP1* is ranked higher than *MYBL2* because there are more paths from diabetes insipidus to *AQP1* than to *MYBL2*, both through model organism phenotypes and through the gene--gene network. Only genes and phenotypes that are associated to both diabetes insipidus and the predicted genes *AQP1* and *MYBL2* are shown.

**Figure 2. Katz features are derived by constructing walks of different kinds on the graph.**
In the figure above, the disease node is connected to the gene node by one walk of length 2 (solid red line) and three walks of length 3 (dotted, dashed and dashdotted red lines). This can be quickly calculated from the adjacency matrix of the graph: If when there is a link between nodes and , and otherwise, the number of paths of length between genes and is . In the example above, and .

formula image — **Figure 2. Katz features are derived by constructing walks of different kinds on the graph.**
In the figure above, the disease node is connected to the gene node by one walk of length 2 (solid red line) and three walks of length 3 (dotted, dashed and dashdotted red lines). This can be quickly calculated from the adjacency matrix of the graph: If when there is a link between nodes and , and otherwise, the number of paths of length between genes and is . In the example above, and .

**Figure 3. Empirical cumulative distribution function for the rank of the withheld gene under cross-validation.**
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top- predictions for a disease. Katz and Catapult methods use all species information, and the **HumanNet** gene network. PRINCE and RWRH methods are implemented as proposed in and respectively, using the **HPRD** gene network. ProDiGe method is implemented as discussed in Methods section. Catapult (solid red) does much better across the data sets under this evaluation scheme. In general, the methods get high precision rates in case of the drug data. PRINCE method that does not allow walks through species phenotypes, and OMIM phenotypes in particular, performs much worse than other random-walk based methods. ProDiGe allows sharing of information between phenotypes using the similarities between OMIM phenotypes and performs reasonably well, whereas there is no such sharing possible in case of the drug data due to the absence of drug similarities. The simple degree-based method performs poorly in general. ProDiGe and PRINCE essentially use only the gene network information in case of the drug data.

**Figure 4. Comparison only using HumanNet.**
Empirical cumulative distribution function for the rank of the withheld gene under cross-validation. Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top- predictions for a disease. Katz and Catapult methods use all species information, and all the methods use the **HumanNet** gene network. PRINCE and RWRH methods are implemented as proposed in and respectively, but using **HumanNet**. ProDiGe method is implemented as discussed in Methods section. Again, as in Figure 3, Catapult (solid red) does the best. An important observation to be made from the plots is that PRINCE and RWRH methods perform relatively much better than in Figure 3, where HPRD network was used. (Note that there is no change to the ProDiGe, Katz and Catapult methods; they have identical settings as in Figure 3).

**Figure 5. Precision-Recall curves for three-fold cross validation.**
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the precision rate, i.e. fraction of true positives in the top- predictions. The horizontal axis shows the recall rate, i.e. ratio of true positives recovered in the top- predictions to the total number of positives for a phenotype (or a drug) in the hidden set. The plots show precision-recall values at various thresholds , in the range and the value at a given is averaged over all the phenotypes (drugs). The plots use the same experimental setup as in Figure 4, and we observe that the comparisons illustrated by precision-recall measure are consistent with the rank cdf measure in Figure 4.

**Figure 6. Empirical cumulative distribution function for the rank of withheld singleton genes.**
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top- predictions for a disease. The Katz and Catapult methods use all species information, and all the methods use the **HumanNet** gene network. PRINCE and RWRH are implemented as proposed in and respectively, but using the **HumanNet** gene network. The ProDiGe method is implemented as discussed in the Methods section. We have not included the degree based list from Figure 4, since all the singleton genes are always given degree 0 during cross-validation. Catapult (solid red) does much better than ProDiGe (the only other supervised method) but does worse compared to walk-based methods than in Figure 4 (that uses the same setting for all the methods). PRINCE and ProDiGe are consistent with (and sometimes perform slightly better than) the full cross-validation evaluation. RWRH and the Katz measure perform better than the supervised learning methods ProDiGe and Catapult in this evaluation scheme. The fact that PRINCE performs so well on singletons in the drug data case is surprising, given that the only information it uses is the HumanNet gene network.

**Figure 7. Empirical cumulative distribution function for the rank of withheld genes from OMIM phenotypes, restricted to genes in a small linkage neighborhood of the withheld genes.**
The vertical axis shows the probability that a true gene association is retrieved in the top- predictions for a disease. The Katz and Catapult methods use all species information, and all the methods use the **HumanNet** gene network. PRINCE and RWRH are implemented as proposed in and respectively but using the **HumanNet** gene network. The ProDiGe method is implemented as discussed in the Methods section. We observe that Catapult performs the best. RWRH and Katz methods are competitive as well. The results are consistent with our observations from Figure 4.

**Figure 8. Empirical cumulative distribution function for the rank of withheld genes from OMIM phenotypes with single known gene (left panel) and more than one known gene (right panel).**
The vertical axis shows the probability that a true gene association is retrieved in the top- predictions for a disease. The Katz and Catapult methods use all species information. PRINCE and RWRH are implemented as proposed in and respectively, using **HPRD** network. The ProDiGe method is implemented as discussed in the Methods section. In case of phenotypes with only one known gene (left panel), the only information is the phenotype-phenotype similarity. From the left panel, we note that all network-based methods perform poorly. Nonetheless, we observe a gradation in the performances of different methods, and that Catapult does slightly better. All the methods do substantially better on phenotypes with more than one known gene (right panel).

**Figure 9. Distribution of the number of known genes in OMIM diseases (left) and drugs (right).**
The bar corresponding to the genes on which we did the singleton validation is shown in yellow.

See this image and copyright information in PMC

References

1. Goh K, Cusick M, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. Proceedings of the National Academy of Sciences 104: 8685. - PMC - PubMed
1. Tian W, Zhang LV, Taan M, Gibbons FD, King OD, et al. (2008) Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biology 9 Suppl 1S7. - PMC - PubMed
1. Ulitsky I, Shamir R (2007) Identification of functional modules using network topology and highthroughput data. BMC systems biology 1: 8. - PMC - PubMed
1. Human Protein Reaction Database, HPRD. Available: http://www.hprd.org. Accessed: 2012 Aug.
1. Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Mol Syst Biol 4: 189. - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Prediction and validation of gene-disease associations using methods inspired by social network analyses

Affiliation

Prediction and validation of gene-disease associations using methods inspired by social network analyses

Authors

Affiliation

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases