Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Mar;15(1):19-27.
doi: 10.5808/GI.2017.15.1.19. Epub 2017 Mar 29.

Use of Graph Database for the Integration of Heterogeneous Biological Data

Affiliations

Use of Graph Database for the Integration of Heterogeneous Biological Data

Byoung-Ha Yoon et al. Genomics Inform. 2017 Mar.

Abstract

Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

Keywords: Neo4j; biological network; data mining; graph database; heterogeneous biological data; query performance.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1. Diagram for optimization of the performance of the Neo4j graph database. Bottom layer: file open limit optimization; Neo4j often produces many small and random reads when querying data. Middle layer: page cache sizing; if all, or at least most, of the graph data files from a hard disk are cached into memory, it will reduce disk access and result in optimal performance. Top layer: heap sizing; it is beneficial to set a large heap space to support various query operations. OS, operating system; JVM, Java Virtual Machine.
Fig. 2
Fig. 2. Preprocessing for data structure modeling of graph database: (1) data set download using CSV or TSV format; (2) standardized representation of each node: gene, protein, disease, etc; (3) integration of node-node (e.g., gene-protein, gene-disease, drug-disease, etc.) associations from multiple data sources; and (4) filtering of unconnected and redundant entities. The final graph database contains 114,550 nodes and 82,674,321 relationships.
Fig. 3
Fig. 3. Construction of graph model of biological relationships. Each node represents a biological element, and nodes are connected by various types of relationships. Each node can define various properties. Relationships can be defined by various types, and each relationship has various properties. This allows a detailed search through the property when retrieving nodes and relationships.
Fig. 4
Fig. 4. Schematic of an integrated graph model, showing the node types and the relationship types used in the integrated biological dataset and how nodes interact with one another. GO, gene ontology; SNP, single nucleotide polymorphism; CNV, copy number variant.
Fig. 5
Fig. 5. Procedure for importing integrated relationship data into a graph database. ‘ DataManager.java’ defines the relationship between each raw data to be input and performs preprocessing steps, such as removing duplicates. ‘ Parsers.java’ reads raw data from a text file and stores them in the graph database. ‘ Mapping.java’ classifies nodes and relationships from the parsed raw data. ‘ Filter.java’ removes duplicate or ambiguous nodes and relationships among created nodes and relationships. ‘ BuildManager.java’ structures the filtered nodes and relationships information according to the previously defined graph database model structure. ‘ DataStructure. java’ and ‘ Integrate.java’ build a graph database by allocating nodes and relationships according to the modeled database structure.
Fig. 6
Fig. 6. Comparison of the performance of query execution between optimized and non-optimized servers. Two servers were queried using the same search operation; the optimized server took 138 ms, whereas the non-optimized server took 316 ms.
Fig. 7
Fig. 7. Comparison of the performance of query execution between relational and graph databases. MySQL and Neo4j were compared by searching relationships on 3 and 4 layers. The search for 3 layers is a search for gene-disease-drugs associated with a particular disease. The search for 4 layers is a search for gene-protein-drugpathway associated with a particular protein.
Fig. 8
Fig. 8. Examples of using a graph database to find biologically meaningful information. Comparison of the nodes in the shortest path and the nodes in the other path (A) and flexible extension of the existing graph database with a new type of information (B).

References

    1. Hartwell LH, Hopfield JJ, Leibler S, Murray AW. From molecular to modular cell biology. Nature. 1999;402(6761) Suppl:C47–C52. - PubMed
    1. Kitano H. Computational systems biology. Nature. 2002;420:206–210. - PubMed
    1. Koonin EV, Wolf YI, Karev GP. The structure of the protein universe and genome evolution. Nature. 2002;420:218–223. - PubMed
    1. Alon U. Biological networks: the tinkerer as an engineer. Science. 2003;301:1866–1867. - PubMed
    1. Bray D. Molecular networks: the top-down view. Science. 2003;301:1864–1865. - PubMed