High dimensional biological data retrieval optimization with NoSQL technology

Shicai Wang, Ioannis Pandis, Chao Wu, Sijin He, David Johnson, Ibrahim Emam, Florian Guitton, Yike Guo

PMID: 25435347
PMCID: PMC4248814
DOI: 10.1186/1471-2164-15-S8-S3

High dimensional biological data retrieval optimization with NoSQL technology

Shicai Wang et al. BMC Genomics. 2014.

. 2014;15 Suppl 8(Suppl 8):S3.

doi: 10.1186/1471-2164-15-S8-S3. Epub 2014 Nov 13.

Authors

Shicai Wang, Ioannis Pandis, Chao Wu, Sijin He, David Johnson, Ibrahim Emam, Florian Guitton, Yike Guo

PMID: 25435347
PMCID: PMC4248814
DOI: 10.1186/1471-2164-15-S8-S3

Abstract

Background: High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data.

Results: In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB.

Conclusions: The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.

PubMed Disclaimer

Figures

**Figure 1**
**JSON example**. Example of a JSON object that maps to the patient record illustrated in Table 2.

**Figure 2**
**Performance of key-value vs. relational data model**. The bar chart shows the query retrieval times for each of the test cases over varying numbers of patient record queries. The NoSQL model implementation on HBase performs the best with an approximately 3.06~7.42-fold increase in query performance than relational model on MySQL Cluster and 2.68~10.50-fold increase than the relational model on MongoDB.

**Figure 3**
**Performance of Random Read vs. Scan in key-value data model**. The bar chart shows the query retrieval times for each of the test cases over varying numbers of patient record queries using both Random Read and Scan methods. The numbers above scan bar show the patient numbers the scan read in that case. The error bar shows the deviation of each test.

**Figure 4**
**Performance of Random Read in different Families**. The bar chart shows the query retrieval times for each of the test cases over varying numbers of patient record queries of three Families. The error bar shows the deviation of each test.

**Figure 5**
**Performance of Scan in different Families**. The bar chart shows the query retrieval times for each of the test cases over varying numbers of patient record queries of three Families. The error bar shows the deviation of each test.

See this image and copyright information in PMC

References

1. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall K a, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013;41(Database):D991–5. - PMC - PubMed
1. Kapushesky M, Adamusiak T, Burdett T, Culhane A, Farne A, Filippov A, Holloway E, Klebanov A, Kryvych N, Kurbatova N, Kurnosov P, Malone J, Melnichuk O, Petryszak R, Pultsin N, Rustici G, Tikhonov A, Travillian RS, Williams E, Zorin A, Parkinson H, Brazma A. Gene Expression Atlas update -- a value-added database of microarray and sequencing-based functional genomics experiments. Gene Expr. 2012;40:1–5. - PMC - PubMed
1. Szalma S, Koka V, Khasanova T, Perakslis ED. Effective knowledge management in translational medicine. J Transl Med. 2010;8:68. doi: 10.1186/1479-5876-8-68. - DOI - PMC - PubMed
1. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. GenePattern 2.0. Nat Genet. 2006. pp. 500–501. - PubMed
1. Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. J Mol Diagn. 2003;5:73–81. doi: 10.1016/S1525-1578(10)60455-2. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

High dimensional biological data retrieval optimization with NoSQL technology

High dimensional biological data retrieval optimization with NoSQL technology

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources