Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 May 1;8(5):e58977.
doi: 10.1371/journal.pone.0058977. Print 2013.

Prediction and validation of gene-disease associations using methods inspired by social network analyses

Affiliations

Prediction and validation of gene-disease associations using methods inspired by social network analyses

U Martin Singh-Blom et al. PLoS One. .

Erratum in

  • PLoS One. 2013;8(9). doi:10.1371/annotation/5aeb88a0-1630-4a07-bb49-32cb5d617af1

Abstract

Correctly identifying associations of genes with diseases has long been a goal in biology. With the emergence of large-scale gene-phenotype association datasets in biology, we can leverage statistical and machine learning methods to help us achieve this goal. In this paper, we present two methods for predicting gene-disease associations based on functional gene associations and gene-phenotype associations in model organisms. The first method, the Katz measure, is motivated from its success in social network link prediction, and is very closely related to some of the recent methods proposed for gene-disease association inference. The second method, called Catapult (Combining dATa Across species using Positive-Unlabeled Learning Techniques), is a supervised machine learning method that uses a biased support vector machine where the features are derived from walks in a heterogeneous gene-trait network. We study the performance of the proposed methods and related state-of-the-art methods using two different evaluation strategies, on two distinct data sets, namely OMIM phenotypes and drug-target interactions. Finally, by measuring the performance of the methods using two different evaluation strategies, we show that even though both methods perform very well, the Katz measure is better at identifying associations between traits and poorly studied genes, whereas Catapult is better suited to correctly identifying gene-trait associations overall [corrected].

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The combined network in the neighborhood of a human disease.
The local network around the human disease diabetes insipidus and two genes highly ranked by Catapult, AQP1 (top ranked candidate) and MYBL2 (ranked as number 40). AQP1 is ranked higher than MYBL2 because there are more paths from diabetes insipidus to AQP1 than to MYBL2, both through model organism phenotypes and through the gene--gene network. Only genes and phenotypes that are associated to both diabetes insipidus and the predicted genes AQP1 and MYBL2 are shown.
Figure 2
Figure 2. Katz features are derived by constructing walks of different kinds on the graph.
In the figure above, the disease node formula image is connected to the gene node formula image by one walk of length 2 (solid red line) and three walks of length 3 (dotted, dashed and dashdotted red lines). This can be quickly calculated from the adjacency matrix formula image of the graph: If formula image when there is a link between nodes formula image and formula image, and formula image otherwise, the number of paths of length formula image between genes formula image and formula image is formula image. In the example above, formula image and formula image.
Figure 3
Figure 3. Empirical cumulative distribution function for the rank of the withheld gene under cross-validation.
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top-formula image predictions for a disease. Katz and Catapult methods use all species information, and the HumanNet gene network. PRINCE and RWRH methods are implemented as proposed in and respectively, using the HPRD gene network. ProDiGe method is implemented as discussed in Methods section. Catapult (solid red) does much better across the data sets under this evaluation scheme. In general, the methods get high precision rates in case of the drug data. PRINCE method that does not allow walks through species phenotypes, and OMIM phenotypes in particular, performs much worse than other random-walk based methods. ProDiGe allows sharing of information between phenotypes using the similarities between OMIM phenotypes and performs reasonably well, whereas there is no such sharing possible in case of the drug data due to the absence of drug similarities. The simple degree-based method performs poorly in general. ProDiGe and PRINCE essentially use only the gene network information in case of the drug data.
Figure 4
Figure 4. Comparison only using HumanNet.
Empirical cumulative distribution function for the rank of the withheld gene under cross-validation. Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top-formula image predictions for a disease. Katz and Catapult methods use all species information, and all the methods use the HumanNet gene network. PRINCE and RWRH methods are implemented as proposed in and respectively, but using HumanNet. ProDiGe method is implemented as discussed in Methods section. Again, as in Figure 3, Catapult (solid red) does the best. An important observation to be made from the plots is that PRINCE and RWRH methods perform relatively much better than in Figure 3, where HPRD network was used. (Note that there is no change to the ProDiGe, Katz and Catapult methods; they have identical settings as in Figure 3).
Figure 5
Figure 5. Precision-Recall curves for three-fold cross validation.
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the precision rate, i.e. fraction of true positives in the top-formula image predictions. The horizontal axis shows the recall rate, i.e. ratio of true positives recovered in the top-formula image predictions to the total number of positives for a phenotype (or a drug) in the hidden set. The plots show precision-recall values at various thresholds formula image, in the range formula image and the value at a given formula image is averaged over all the phenotypes (drugs). The plots use the same experimental setup as in Figure 4, and we observe that the comparisons illustrated by precision-recall measure are consistent with the rank cdf measure in Figure 4.
Figure 6
Figure 6. Empirical cumulative distribution function for the rank of withheld singleton genes.
Left panel corresponds to evaluation of OMIM phenotypes, and the right corresponds to drug data. The vertical axis shows the probability that a true gene association is retrieved in the top-formula image predictions for a disease. The Katz and Catapult methods use all species information, and all the methods use the HumanNet gene network. PRINCE and RWRH are implemented as proposed in and respectively, but using the HumanNet gene network. The ProDiGe method is implemented as discussed in the Methods section. We have not included the degree based list from Figure 4, since all the singleton genes are always given degree 0 during cross-validation. Catapult (solid red) does much better than ProDiGe (the only other supervised method) but does worse compared to walk-based methods than in Figure 4 (that uses the same setting for all the methods). PRINCE and ProDiGe are consistent with (and sometimes perform slightly better than) the full cross-validation evaluation. RWRH and the Katz measure perform better than the supervised learning methods ProDiGe and Catapult in this evaluation scheme. The fact that PRINCE performs so well on singletons in the drug data case is surprising, given that the only information it uses is the HumanNet gene network.
Figure 7
Figure 7. Empirical cumulative distribution function for the rank of withheld genes from OMIM phenotypes, restricted to genes in a small linkage neighborhood of the withheld genes.
The vertical axis shows the probability that a true gene association is retrieved in the top-formula image predictions for a disease. The Katz and Catapult methods use all species information, and all the methods use the HumanNet gene network. PRINCE and RWRH are implemented as proposed in and respectively but using the HumanNet gene network. The ProDiGe method is implemented as discussed in the Methods section. We observe that Catapult performs the best. RWRH and Katz methods are competitive as well. The results are consistent with our observations from Figure 4.
Figure 8
Figure 8. Empirical cumulative distribution function for the rank of withheld genes from OMIM phenotypes with single known gene (left panel) and more than one known gene (right panel).
The vertical axis shows the probability that a true gene association is retrieved in the top-formula image predictions for a disease. The Katz and Catapult methods use all species information. PRINCE and RWRH are implemented as proposed in and respectively, using HPRD network. The ProDiGe method is implemented as discussed in the Methods section. In case of phenotypes with only one known gene (left panel), the only information is the phenotype-phenotype similarity. From the left panel, we note that all network-based methods perform poorly. Nonetheless, we observe a gradation in the performances of different methods, and that Catapult does slightly better. All the methods do substantially better on phenotypes with more than one known gene (right panel).
Figure 9
Figure 9. Distribution of the number of known genes in OMIM diseases (left) and drugs (right).
The bar corresponding to the genes on which we did the singleton validation is shown in yellow.

References

    1. Goh K, Cusick M, Valle D, Childs B, Vidal M, et al. (2007) The human disease network. Proceedings of the National Academy of Sciences 104: 8685. - PMC - PubMed
    1. Tian W, Zhang LV, Taan M, Gibbons FD, King OD, et al. (2008) Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function. Genome Biology 9 Suppl 1S7. - PMC - PubMed
    1. Ulitsky I, Shamir R (2007) Identification of functional modules using network topology and highthroughput data. BMC systems biology 1: 8. - PMC - PubMed
    1. Human Protein Reaction Database, HPRD. Available: http://www.hprd.org. Accessed: 2012 Aug.
    1. Wu X, Jiang R, Zhang MQ, Li S (2008) Network-based global inference of human disease genes. Mol Syst Biol 4: 189. - PMC - PubMed

Publication types