A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design

Affiliations

¹ Computer Science, Vanderbilt University, Nashville, TN, USA 37235.
² Electrical Engineering, Vanderbilt University, Nashville, TN, USA 37235.
³ Biomedical Engineering, Vanderbilt University, Nashville, TN, USA 37235.

PMID: 29887668
PMCID: PMC5991614
DOI: 10.1117/12.2293694

A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design

Shunxing Bao et al. Proc SPIE Int Soc Opt Eng. 2018 Mar.

. 2018 Mar:10597:105790A.

doi: 10.1117/12.2293694.

Authors

Affiliations

¹ Computer Science, Vanderbilt University, Nashville, TN, USA 37235.
² Electrical Engineering, Vanderbilt University, Nashville, TN, USA 37235.
³ Biomedical Engineering, Vanderbilt University, Nashville, TN, USA 37235.

PMID: 29887668
PMCID: PMC5991614
DOI: 10.1117/12.2293694

Abstract

When processing large medical imaging studies, adopting high performance grid computing resources rapidly becomes important. We recently presented a "medical image processing-as-a-service" grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase for data colocation by moving computation close to medical image storage. However, the framework has not yet proven to be easy to use in a heterogeneous hardware environment. Furthermore, the system has not yet validated when considering variety of multi-level analysis in medical imaging. Our target design criteria are (1) improving the framework's performance in a heterogeneous cluster, (2) performing population based summary statistics on large datasets, and (3) introducing a table design scheme for rapid NoSQL query. In this paper, we present a heuristic backend interface application program interface (API) design for Hadoop & HBase for Medical Image Processing (HadoopBase-MIP). The API includes: Upload, Retrieve, Remove, Load balancer (for heterogeneous cluster) and MapReduce templates. A dataset summary statistic model is discussed and implemented by MapReduce paradigm. We introduce a HBase table scheme for fast data query to better utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a university secure, shared web database and used to empirically access an in-house grid with 224 heterogeneous CPU cores. Three empirical experiments results are presented and discussed: (1) load balancer wall-time improvement of 1.5-fold compared with a framework with built-in data allocation strategy, (2) a summary statistic model is empirically verified on grid framework and is compared with the cluster when deployed with a standard Sun Grid Engine (SGE), which reduces 8-fold of wall clock time and 14-fold of resource time, and (3) the proposed HBase table scheme improves MapReduce computation with 7 fold reduction of wall time compare with a naïve scheme when datasets are relative small. The source code and interfaces have been made publicly available.

PubMed Disclaimer

Figures

**Figure 1**
Use cases for three main challenges. (A) If a traditional cluster model is used, average throughput would be seen (red dash), which would leave some machines starved (e.g. A, B), while others overloaded (e.g. C, D and E). Hence, a traditional approach will degrade the overall execution time due to those overloaded/starved cores. (B) The time to run a large dataset depends on total number of jobs and the longest map job to take. The total number of jobs should be neither too large or too small. (C) HBase is not designed for storing image data given variability of size and volume of medical imaging studies. If information like age / sex / genetics are stored in same column with image data, image traversal is unavoidable, which degrades the search efficiency.

**Figure 2**
HadoopBase-MIP system interface overview. Except cluster monitoring, all operations are extended.

**Figure 3**
MapReduce model implementation for constructing population specific brain MRI templates.

**Figure 3**
Experiment cluster setup and Data allocation for HadoopBase-MIP. Two different systems (eight machines with 12 slower cores and four machines with 32 fast cores) are used in cluster before applying the load balancer, each machine contains similar amount of image data. After using the load balancer, the data allocations match the ratio #CPU*MIPS.

**Figure 5**
Qualitative results for summary statistics analysis on large datasets and age / sex-specific image averaging analysis.

**Figure-6**
Proposed table scheme design vs. naïve scheme vs. SGE

See this image and copyright information in PMC

References

1. Apache Hadoop Project Team. The Apache Hadoop Ecosystem
1. Apache HBase Team. Apache hbase reference guide
1. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM. 2008;51(1):107–113.
1. Jiang J, Lu J, Zhang G, et al. Scaling-up item-based collaborative filtering recommendation algorithm based on hadoop; 2011 IEEE Word Congress; pp. 490–497.
1. Walunj SG, Sadafale K. Proceedings of the 2013 annual conference on Computers and people research. ACM; 2013. An online recommendation system for e-commerce based on apache mahout framework; pp. 153–158.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design

Affiliations

A Data Colocation Grid Framework for Big Data Medical Image Processing: Backend Design

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources