Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 6;22(Suppl 6):347.
doi: 10.1186/s12911-023-02112-8.

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Affiliations

Decision tree learning in Neo4j on homogeneous and unconnected graph nodes from biological and clinical datasets

Rahul Mondal et al. BMC Med Inform Decis Mak. .

Abstract

Background: Graph databases enable efficient storage of heterogeneous, highly-interlinked data, such as clinical data. Subsequently, researchers can extract relevant features from these datasets and apply machine learning for diagnosis, biomarker discovery, or understanding pathogenesis.

Methods: To facilitate machine learning and save time for extracting data from the graph database, we developed and optimized Decision Tree Plug-in (DTP) containing 24 procedures to generate and evaluate decision trees directly in the graph database Neo4j on homogeneous and unconnected nodes.

Results: Creation of the decision tree for three clinical datasets directly in the graph database from the nodes required between 0.059 and 0.099 s, while calculating the decision tree with the same algorithm in Java from CSV files took 0.085-0.112 s. Furthermore, our approach was faster than the standard decision tree implementations in R (0.62 s) and equal to Python (0.08 s), also using CSV files as input for small datasets. In addition, we have explored the strengths of DTP by evaluating a large dataset (approx. 250,000 instances) to predict patients with diabetes and compared the performance against algorithms generated by state-of-the-art packages in R and Python. By doing so, we have been able to show competitive results on the performance of Neo4j, in terms of quality of predictions as well as time efficiency. Furthermore, we could show that high body-mass index and high blood pressure are the main risk factors for diabetes.

Conclusion: Overall, our work shows that integrating machine learning into graph databases saves time for additional processes as well as external memory, and could be applied to a variety of use cases, including clinical applications. This provides user with the advantages of high scalability, visualization and complex querying.

Keywords: Cypher; Decision tree; Graph database; Java; Neo4j; Python; R.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Data flow in the decision tree plug-in (DTP)
Fig. 2
Fig. 2
Available procedures in DTP
Fig. 3
Fig. 3
Box Plots—Accuracy and Matthews Correlation Coefficient of the algorithms: A, D for different tools including DTP, B, E for different splitting criteria in DTP, and C, F for the datasets 1-3 in DTP
Fig. 4
Fig. 4
Box Plots—Generation Time of the Decision Tree Algorithms: A for different tools including DTP, B for different splitting criteria in DTP, and C for the datasets 1-3 in DTP
Fig. 5
Fig. 5
Box Plots—Evaluation of the diabetes dataset(Dataset 4), across different tools: A accuracy, B precision, C Matthews Correlation Coefficient, and D generation time
Fig. 6
Fig. 6
Scatter Plots with Line of Regression—To interpolate the effect of instance size (rows/nodes) on generation time and accuracy of algorithms generated by DTP for all the 4 datasets (Dataset 1, 2, 3 and 4)
Fig. 7
Fig. 7
Dataset 4 uploaded as homogeneous and unconnected nodes in Neo4j
Fig. 8
Fig. 8
Decision Tree for Dataset 4 (split = gain ratio) The red nodes represent the leaf nodes indicating diagnosis of diabetes (2), borderline (1) or no diabetes (0) in a patient, while the blue nodes are the decision nodes. Note that, this tree was generated on a subset of dataset 4 after the class imbalance was handled. There were 13893 instances (4631 for each class label) and 22 variables. The tree has been pruned to max_depth = 2

References

    1. Santos A, Colaço AR, Nielsen AB, Niu L, Geyer PE, Coscia F, Albrechtsen NJW, Mundt F, Jensen LJ, Mann M. Clinical knowledge graph integrates proteomics data into clinical decision-making. bioRxiv 2020; 10.1101/2020.05.09.084897.
    1. Chicco D, Jurman G. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genomics. 2020 doi: 10.1186/s12864-019-6413-7. - DOI - PMC - PubMed
    1. Aziz T, Haq E-U, Muhammad D. Performance based comparison between RDBMS and OODBMS. Int J Comput Appl. 2018;180(17):42–46. doi: 10.5120/ijca2018916410. - DOI
    1. Vicknair C, Macias M, Zhao Z, Nan X, Chen Y, Wilkins D. A comparison of a graph database and a relational database. ACM Press, 2010; 10.1145/1900008.1900067.
    1. Pokorn J. Graph databases: their power and limitations 2015.

Publication types

LinkOut - more resources