Distributed Tensor Decomposition for Large Scale Health Analytics

Huan He¹, Jette Henderson², Joyce C Ho¹

Affiliations

PMID: 31198910
PMCID: PMC6563812
DOI: 10.1145/3308558.3313548

Distributed Tensor Decomposition for Large Scale Health Analytics

Huan He et al. Proc Int World Wide Web Conf. 2019 May.

. 2019 May:2019:659-669.

doi: 10.1145/3308558.3313548.

Authors

Huan He¹, Jette Henderson², Joyce C Ho¹

Affiliations

¹ Emory University, Atlanta, Georgia.
² CognitiveScale, Austin, Texas.

PMID: 31198910
PMCID: PMC6563812
DOI: 10.1145/3308558.3313548

Abstract

In the past few decades, there has been rapid growth in quantity and variety of healthcare data. These large sets of data are usually high dimensional (e.g. patients, their diagnoses, and medications to treat their diagnoses) and cannot be adequately represented as matrices. Thus, many existing algorithms can not analyze them. To accommodate these high dimensional data, tensor factorization, which can be viewed as a higher-order extension of methods like PCA, has attracted much attention and emerged as a promising solution. However, tensor factorization is a computationally expensive task, and existing methods developed to factor large tensors are not flexible enough for real-world situations. To address this scaling problem more efficiently, we introduce SGranite, a distributed, scalable, and sparse tensor factorization method fit through stochastic gradient descent. SGranite offers three contributions: (1) Scalability: it employs a block partitioning and parallel processing design and thus scales to large tensors, (2) Accuracy: we show that our method can achieve results faster without sacrificing the quality of the tensor decomposition, and (3) FlexibleConstraints: we show our approach can encompass various kinds of constraints including l2 norm, l1 norm, and logistic regularization. We demonstrate SGranite's capabilities in two real-world use cases. In the first, we use Google searches for flu-like symptoms to characterize and predict influenza patterns. In the second, we use SGranite to extract clinically interesting sets (i.e., phenotypes) of patients from electronic health records. Through these case studies, we show SGranite has the potential to be used to rapidly characterize, predict, and manage a large multimodal datasets, thereby promising a novel, data-driven solution that can benefit very large segments of the population.

Keywords: Apache Spark; Distributed Algorithm; Health Analytics; Tensor Decomposition; User-Generated Content; Web Mining.

PubMed Disclaimer

Figures

**Figure 1**
An example of CP decomposition for influenza search data. A tensor is constructed of time series data is decomposed into the weighted sum of rank-one tensors based on the minimization of an objective function. Each rank-one tensor, formed by taking the outer product of factor vectors, constitutes a latent factor.

**Figure 2**
A graphical example of our SGranite: Suppose there are 2 workers, we will have 8 blocks and 4 strata after partition. We run this process iteratively until convergence. In each epoch, start from strata one, each worker runs SGD for its own assigned block in parallel. Check the convergence until all strata are iterated. We repeat above algorithm again if the stopping criteria is not satisfied. All intermediate results are saved as Resilient Distributed Datasets (RDD) collections and cached in memory.

**Figure 3**
**A graphical example of one stratum training: Given one stratum of training data and factor matrices** A⁽¹⁾, A⁽²⁾, A⁽³⁾, **we run SGD on each block in parallel. Then factor matrices** A⁽¹⁾, A⁽²⁾, A⁽³⁾ **are updated and used as the initialization for the next stratum training.**

**Figure 4**
Comparison of two distributed and two non-distributed CP models using KL divergence. SGranite converges in less epochs than the other methods. The negative KL divergence arises from the fact that the observed values are not probability measurements.

**Figure 5**
The speed-up curve for both datasets. It shows analysis of large datasets will gain an obvious speed up by using SGranite.

**Figure 6**
**A comparison of the learned latent factors with and without constraints using** R = 3. **Year from 2003 to 2015**

**Figure 7**
Latent factors obtained using Flexifact. Year from 2003 to 2015

**Figure 8**
This figure is downloaded from CDC, it shows the actual influenza positive tests reported to CDC in 2010–2011, week ending Oct 01, 2001.

See this image and copyright information in PMC

Cited by

Creating High-Quality Synthetic Health Data: Framework for Model Development and Validation.
Karimian Sichani E, Smith A, El Emam K, Mosquera L. Karimian Sichani E, et al. JMIR Form Res. 2024 Apr 22;8:e53241. doi: 10.2196/53241. JMIR Form Res. 2024. PMID: 38648097 Free PMC article.
Deep representation learning of patient data from Electronic Health Records (EHR): A systematic review.
Si Y, Du J, Li Z, Jiang X, Miller T, Wang F, Jim Zheng W, Roberts K. Si Y, et al. J Biomed Inform. 2021 Mar;115:103671. doi: 10.1016/j.jbi.2020.103671. Epub 2020 Dec 31. J Biomed Inform. 2021. PMID: 33387683 Free PMC article.
Communication Efficient Federated Generalized Tensor Factorization for Collaborative Health Data Analytics.
Ma J, Zhang Q, Lou J, Xiong L, Ho JC. Ma J, et al. Proc Int World Wide Web Conf. 2021 Apr;2021:171-182. doi: 10.1145/3442381.3449832. Proc Int World Wide Web Conf. 2021. PMID: 34467367 Free PMC article.
Improving Diagnostics with Deep Forest Applied to Electronic Health Records.
Khodadadi A, Ghanbari Bousejin N, Molaei S, Kumar Chauhan V, Zhu T, Clifton DA. Khodadadi A, et al. Sensors (Basel). 2023 Jul 21;23(14):6571. doi: 10.3390/s23146571. Sensors (Basel). 2023. PMID: 37514865 Free PMC article.

References

1. Flu Season. [n. d.]. https://en.wikipedia.org/wiki/Flu_season.
1. National Institutes of Health. [n. d.]. https://allofus.nih.gov/
1. Acar Evrim, Dunlavy Daniel M, and Kolda Tamara G. 2011. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics 25, 2 (Feb. 2011), 67–86.
1. Afshar Ardavan, Perros Ioakeim, Papalexakis Evangelos E, Searles Elizabeth, Ho Joyce C, and Sun Jimeng. 2018. COPA: Constrained PARAFAC2 for sparse & large datasets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 793–802. - PMC - PubMed
1. Arango MF and Mejia-Mantilla JH. 2006. Magnesium for acute traumatic brain injury - PubMed

Grants and funding

K01 LM012924/LM/NLM NIH HHS/United States

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Distributed Tensor Decomposition for Large Scale Health Analytics

Affiliations

Distributed Tensor Decomposition for Large Scale Health Analytics

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous