. 2023 Mar 6;22(Suppl 6):347.

doi: 10.1186/s12911-023-02112-8.

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Rahul Mondal^#¹, Minh Dung Do^#², Nasim Uddin Ahmed^#², Daniel Walke³, Daniel Micheel⁴, David Broneske^{4

5}, Gunter Saake⁵, Robert Heyer⁵

Affiliations

¹ Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany. rahulmondal415@gmail.com.
² Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.
³ Faculty of Process and Systems Engineering, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.
⁴ German Centre for Higher Education Research and Science Studies, Lange Laube 12, 30159, Hannover, Germany.
⁵ Research Group Databases and Software Engineering, Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.

^# Contributed equally.

PMID: 36879243
PMCID: PMC9988195
DOI: 10.1186/s12911-023-02112-8

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Rahul Mondal et al. BMC Med Inform Decis Mak. 2023.

. 2023 Mar 6;22(Suppl 6):347.

doi: 10.1186/s12911-023-02112-8.

Authors

Rahul Mondal^#¹, Minh Dung Do^#², Nasim Uddin Ahmed^#², Daniel Walke³, Daniel Micheel⁴, David Broneske^{4

5}, Gunter Saake⁵, Robert Heyer⁵

Affiliations

¹ Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany. rahulmondal415@gmail.com.
² Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.
³ Faculty of Process and Systems Engineering, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.
⁴ German Centre for Higher Education Research and Science Studies, Lange Laube 12, 30159, Hannover, Germany.
⁵ Research Group Databases and Software Engineering, Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany.

^# Contributed equally.

PMID: 36879243
PMCID: PMC9988195
DOI: 10.1186/s12911-023-02112-8

Abstract

Background: Graph databases enable efficient storage of heterogeneous, highly-interlinked data, such as clinical data. Subsequently, researchers can extract relevant features from these datasets and apply machine learning for diagnosis, biomarker discovery, or understanding pathogenesis.

Methods: To facilitate machine learning and save time for extracting data from the graph database, we developed and optimized Decision Tree Plug-in (DTP) containing 24 procedures to generate and evaluate decision trees directly in the graph database Neo4j on homogeneous and unconnected nodes.

Results: Creation of the decision tree for three clinical datasets directly in the graph database from the nodes required between 0.059 and 0.099 s, while calculating the decision tree with the same algorithm in Java from CSV files took 0.085-0.112 s. Furthermore, our approach was faster than the standard decision tree implementations in R (0.62 s) and equal to Python (0.08 s), also using CSV files as input for small datasets. In addition, we have explored the strengths of DTP by evaluating a large dataset (approx. 250,000 instances) to predict patients with diabetes and compared the performance against algorithms generated by state-of-the-art packages in R and Python. By doing so, we have been able to show competitive results on the performance of Neo4j, in terms of quality of predictions as well as time efficiency. Furthermore, we could show that high body-mass index and high blood pressure are the main risk factors for diabetes.

Conclusion: Overall, our work shows that integrating machine learning into graph databases saves time for additional processes as well as external memory, and could be applied to a variety of use cases, including clinical applications. This provides user with the advantages of high scalability, visualization and complex querying.

Keywords: Cypher; Decision tree; Graph database; Java; Neo4j; Python; R.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Data flow in the decision tree plug-in (DTP)

**Fig. 3**
Box Plots—Accuracy and Matthews Correlation Coefficient of the algorithms: A, D for different tools including DTP, B, E for different splitting criteria in DTP, and C, F for the datasets 1-3 in DTP

**Fig. 4**
Box Plots—Generation Time of the Decision Tree Algorithms: A for different tools including DTP, B for different splitting criteria in DTP, and C for the datasets 1-3 in DTP

**Fig. 5**
Box Plots—Evaluation of the diabetes dataset(Dataset 4), across different tools: A accuracy, B precision, C Matthews Correlation Coefficient, and D generation time

**Fig. 6**
Scatter Plots with Line of Regression—To interpolate the effect of instance size (rows/nodes) on generation time and accuracy of algorithms generated by DTP for all the 4 datasets (Dataset 1, 2, 3 and 4)

**Fig. 7**
Dataset 4 uploaded as homogeneous and unconnected nodes in Neo4j

**Fig. 8**
Decision Tree for Dataset 4 (split = gain ratio) The red nodes represent the leaf nodes indicating diagnosis of diabetes (2), borderline (1) or no diabetes (0) in a patient, while the blue nodes are the decision nodes. Note that, this tree was generated on a subset of dataset 4 after the class imbalance was handled. There were 13893 instances (4631 for each class label) and 22 variables. The tree has been pruned to max_depth = 2

See this image and copyright information in PMC

References

1. Santos A, Colaço AR, Nielsen AB, Niu L, Geyer PE, Coscia F, Albrechtsen NJW, Mundt F, Jensen LJ, Mann M. Clinical knowledge graph integrates proteomics data into clinical decision-making. bioRxiv 2020; 10.1101/2020.05.09.084897.
1. Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics. 2020 doi: 10.1186/s12864-019-6413-7. - DOI - PMC - PubMed
1. Aziz T, Haq E-U, Muhammad D. Performance based comparison between RDBMS and OODBMS. Int J Comput Appl. 2018;180(17):42–46. doi: 10.5120/ijca2018916410. - DOI
1. Vicknair C, Macias M, Zhao Z, Nan X, Chen Y, Wilkins D. A comparison of a graph database and a relational database. ACM Press, 2010; 10.1145/1900008.1900067.
1. Pokorn J. Graph databases: their power and limitations 2015.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Affiliations

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources