. 2018 Jul 1;34(13):i447-i456.

doi: 10.1093/bioinformatics/bty289.

Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information

Pooya Zakeri¹, Jaak Simm¹, Adam Arany¹, Sarah ElShal¹, Yves Moreau¹

Affiliations

Affiliation

¹ Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium.

PMID: 29949967
PMCID: PMC6022676
DOI: 10.1093/bioinformatics/bty289

Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information

Pooya Zakeri et al. Bioinformatics. 2018.

. 2018 Jul 1;34(13):i447-i456.

doi: 10.1093/bioinformatics/bty289.

Authors

Pooya Zakeri¹, Jaak Simm¹, Adam Arany¹, Sarah ElShal¹, Yves Moreau¹

Affiliation

¹ Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing and Data Analytics, KU Leuven and imec, Kapeldreef Leuven, Belgium.

PMID: 29949967
PMCID: PMC6022676
DOI: 10.1093/bioinformatics/bty289

Abstract

Motivation: Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make non-trivial predictions for genes for which no previous disease association is known.

Results: Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour.

Availability and implementation: The Bayesian data fusion method is implemented as a Python/C++ package: https://github.com/jaak-s/macau. It is also available as a Julia package: https://github.com/jaak-s/BayesianDataFusion.jl. All data and benchmarks generated or analyzed during this study can be downloaded at https://owncloud.esat.kuleuven.be/index.php/s/UGb89WfkZwMYoTn.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
The graphical representation of our proposed model. The left panel illustrates the OMIM data base as a partially observed matrix where each row is a gene and each column is a disease phenotype. The goal of our proposed model is to express the OMIM matrix as the product of two matrices *G^T* and P. The right panel shows a graphical representation of our proposed model for Bayesian matrix factorization with side information on both genes and phenotypes

**Fig. 2.**
Concept of gene prioirtization using matrix factorization. In the first step, a gene-disease asscociation database (OMIM in our case) is represented as a paritally observeed matrix. In the second step, extra information available about genes and phenotypes are prepared to be incorporated into the matrix factorization procedure. Both literature-based phenotypic ( $P h e n_T e x t$ ) and literature-based genomic information are extracted from PubMed. A raw fusion approach is employed to integrate multiple genomic data sources. In the third step, our Bayesian data fusion model jointly learns two thin matrices (Gene and Disease factors) and two link matrix (namely, $β_{gene}$ and $β_{phen}$ ). In fact, this step illustrates the architecture of our matrix factorization approach model(GeneHound) for gene prioritization. In the fourth step, we complete the OMIM matrix using the learned gene and disease factors. Finally, in the fifth step, (GeneHound) ranks all genes in each phenotype column of fully predicted OMIM matrix, separately. For each diseases, genes with the highest predicted value are colored in red

**Fig. 3.**
BEDROC scores result: GeneHound versus Endeavour.The performance of GeneHound with various latent dimensions, our final model( $G e n e H o u n d_G e o A g g$ ), and Endeavour are evaluated on our **OMIM2** benchmark. The label of each panel corresponds to the value of α used to evaluate the model. Note that we highlight the black solid lines in the box plots correspond to the median value

**Fig. 4.**
Comparison of the BSV curve for our proposed models and Endeavour. BSV curve is a plot of average BEDROC scores versus the increasing value of alpha in BEDROC Equation (9). In the BSV curve, the greater α, uses the heavier the weight for early discovery. The performance of $G e n e H o u n d_G e o A g g$ and Endeavour are evaluated on **OMIM2** benchmark

**Fig. 5.**
The average BEDROC scores of ICD-10-based disease groups: $G e n e H o u n d_G e o A g g$ versus Endeavour.The average BEDROC scores of nine ICD-10-based disease groups with at least three diseases in **OMIM2** benchmark. The α are set to 16.1 and 160.9

See this image and copyright information in PMC

Cited by

Applications of machine learning to diagnosis and treatment of neurodegenerative diseases.
Myszczynska MA, Ojamies PN, Lacoste AMB, Neil D, Saffari A, Mead R, Hautbergue GM, Holbrook JD, Ferraiuolo L. Myszczynska MA, et al. Nat Rev Neurol. 2020 Aug;16(8):440-456. doi: 10.1038/s41582-020-0377-8. Epub 2020 Jul 15. Nat Rev Neurol. 2020. PMID: 32669685 Review.
Potential Schizophrenia Disease-Related Genes Prediction Using Metagraph Representations Based on a Protein-Protein Interaction Keyword Network: Framework Development and Validation.
Yu S, Wang Z, Nan J, Li A, Yang X, Tang X. Yu S, et al. JMIR Form Res. 2023 Nov 15;7:e50998. doi: 10.2196/50998. JMIR Form Res. 2023. PMID: 37966892 Free PMC article.
HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression.
Jagodnik KM, Shvili Y, Bartal A. Jagodnik KM, et al. PLoS One. 2023 Feb 15;18(2):e0280839. doi: 10.1371/journal.pone.0280839. eCollection 2023. PLoS One. 2023. PMID: 36791052 Free PMC article.
Knowledge graph-aided Bayesian active learning for top-K genetic interaction discovery.
Soper B, Lisicki M, Silva M, Cadena J, Zhu H, Sundaram S, Ray P, Drocco J. Soper B, et al. Sci Rep. 2025 Aug 25;15(1):31196. doi: 10.1038/s41598-025-13972-7. Sci Rep. 2025. PMID: 40854903 Free PMC article.
Identifying potential association on gene-disease network via dual hypergraph regularized least squares.
Yang H, Ding Y, Tang J, Guo F. Yang H, et al. BMC Genomics. 2021 Aug 9;22(1):605. doi: 10.1186/s12864-021-07864-z. BMC Genomics. 2021. PMID: 34372777 Free PMC article.

See all "Cited by" articles

References

1. Aerts S. et al. (2006) Gene prioritization through genomic data fusion. Nat. Biotech., 24, 537–544. - PubMed
1. Amberger J. et al. (2011) A new face and new challenges for Online Mendelian Inheritance in Man (OMIM). Hum. Mutat., 32, 564–567. - PubMed
1. Arany A. et al. (2015) Highly scalable tensor factorization for prediction of drug-protein interaction type. MLCB/MLSB NIPS Workshop. Canada; arXiv: 1512.00315.
1. Bauer-Mehren A. et al. (2011) Gene-disease network analysis reveals functional modules in Mendelian, complex and environmental diseases. PLOS One, 6, e20284. - PMC - PubMed
1. Becker K. et al. (2004) The genetic association database. Nat. Genet., 36, 431–432. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information

Affiliation

Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources