Use of Graph Database for the Integration of Heterogeneous Biological Data

Byoung-Ha Yoon^{1

2}, Seon-Kyu Kim¹, Seon-Young Kim^{1

2}

Affiliations

¹ Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.
² Department of Functional Genomics, University of Science and Technology (UST), Daejeon 34113, Korea.

PMID: 28416946
PMCID: PMC5389944
DOI: 10.5808/GI.2017.15.1.19

Use of Graph Database for the Integration of Heterogeneous Biological Data

Byoung-Ha Yoon et al. Genomics Inform. 2017 Mar.

. 2017 Mar;15(1):19-27.

doi: 10.5808/GI.2017.15.1.19. Epub 2017 Mar 29.

Authors

Byoung-Ha Yoon^{1

2}, Seon-Kyu Kim¹, Seon-Young Kim^{1

2}

Affiliations

¹ Personalized Genomic Medicine Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon 34141, Korea.
² Department of Functional Genomics, University of Science and Technology (UST), Daejeon 34113, Korea.

PMID: 28416946
PMCID: PMC5389944
DOI: 10.5808/GI.2017.15.1.19

Abstract

Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

Keywords: Neo4j; biological network; data mining; graph database; heterogeneous biological data; query performance.

PubMed Disclaimer

Figures

Fig. 1. Diagram for optimization of the performance of the Neo4j graph database. Bottom layer: file open limit optimization; Neo4j often produces many small and random reads when querying data. Middle layer: page cache sizing; if all, or at least most, of the graph data files from a hard disk are cached into memory, it will reduce disk access and result in optimal performance. Top layer: heap sizing; it is beneficial to set a large heap space to support various query operations. OS, operating system; JVM, Java Virtual Machine.

Fig. 2. Preprocessing for data structure modeling of graph database: (1) data set download using CSV or TSV format; (2) standardized representation of each node: gene, protein, disease, etc; (3) integration of node-node (e.g., gene-protein, gene-disease, drug-disease, etc.) associations from multiple data sources; and (4) filtering of unconnected and redundant entities. The final graph database contains 114,550 nodes and 82,674,321 relationships.

Fig. 3. Construction of graph model of biological relationships. Each node represents a biological element, and nodes are connected by various types of relationships. Each node can define various properties. Relationships can be defined by various types, and each relationship has various properties. This allows a detailed search through the property when retrieving nodes and relationships.

Fig. 4. Schematic of an integrated graph model, showing the node types and the relationship types used in the integrated biological dataset and how nodes interact with one another. GO, gene ontology; SNP, single nucleotide polymorphism; CNV, copy number variant.

Fig. 5. Procedure for importing integrated relationship data into a graph database. ‘ DataManager.java’ defines the relationship between each raw data to be input and performs preprocessing steps, such as removing duplicates. ‘ Parsers.java’ reads raw data from a text file and stores them in the graph database. ‘ Mapping.java’ classifies nodes and relationships from the parsed raw data. ‘ Filter.java’ removes duplicate or ambiguous nodes and relationships among created nodes and relationships. ‘ BuildManager.java’ structures the filtered nodes and relationships information according to the previously defined graph database model structure. ‘ DataStructure. java’ and ‘ Integrate.java’ build a graph database by allocating nodes and relationships according to the modeled database structure.

Fig. 6. Comparison of the performance of query execution between optimized and non-optimized servers. Two servers were queried using the same search operation; the optimized server took 138 ms, whereas the non-optimized server took 316 ms.

Fig. 7. Comparison of the performance of query execution between relational and graph databases. MySQL and Neo4j were compared by searching relationships on 3 and 4 layers. The search for 3 layers is a search for gene-disease-drugs associated with a particular disease. The search for 4 layers is a search for gene-protein-drugpathway associated with a particular protein.

Fig. 8. Examples of using a graph database to find biologically meaningful information. Comparison of the nodes in the shortest path and the nodes in the other path (A) and flexible extension of the existing graph database with a new type of information (B).

See this image and copyright information in PMC

References

1. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402(6761) Suppl:C47–C52. - PubMed
1. Kitano H. Computational systems biology. Nature. 2002;420:206–210. - PubMed
1. Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature. 2002;420:218–223. - PubMed
1. Alon U. Biological networks: the tinkerer as an engineer. Science. 2003;301:1866–1867. - PubMed
1. Bray D. Molecular networks: the top-down view. Science. 2003;301:1864–1865. - PubMed

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Use of Graph Database for the Integration of Heterogeneous Biological Data

Affiliations

Use of Graph Database for the Integration of Heterogeneous Biological Data

Authors

Affiliations

Abstract

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources