Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May:2019:659-669.
doi: 10.1145/3308558.3313548.

Distributed Tensor Decomposition for Large Scale Health Analytics

Affiliations

Distributed Tensor Decomposition for Large Scale Health Analytics

Huan He et al. Proc Int World Wide Web Conf. 2019 May.

Abstract

In the past few decades, there has been rapid growth in quantity and variety of healthcare data. These large sets of data are usually high dimensional (e.g. patients, their diagnoses, and medications to treat their diagnoses) and cannot be adequately represented as matrices. Thus, many existing algorithms can not analyze them. To accommodate these high dimensional data, tensor factorization, which can be viewed as a higher-order extension of methods like PCA, has attracted much attention and emerged as a promising solution. However, tensor factorization is a computationally expensive task, and existing methods developed to factor large tensors are not flexible enough for real-world situations. To address this scaling problem more efficiently, we introduce SGranite, a distributed, scalable, and sparse tensor factorization method fit through stochastic gradient descent. SGranite offers three contributions: (1) Scalability: it employs a block partitioning and parallel processing design and thus scales to large tensors, (2) Accuracy: we show that our method can achieve results faster without sacrificing the quality of the tensor decomposition, and (3) FlexibleConstraints: we show our approach can encompass various kinds of constraints including l2 norm, l1 norm, and logistic regularization. We demonstrate SGranite's capabilities in two real-world use cases. In the first, we use Google searches for flu-like symptoms to characterize and predict influenza patterns. In the second, we use SGranite to extract clinically interesting sets (i.e., phenotypes) of patients from electronic health records. Through these case studies, we show SGranite has the potential to be used to rapidly characterize, predict, and manage a large multimodal datasets, thereby promising a novel, data-driven solution that can benefit very large segments of the population.

Keywords: Apache Spark; Distributed Algorithm; Health Analytics; Tensor Decomposition; User-Generated Content; Web Mining.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example of CP decomposition for influenza search data. A tensor is constructed of time series data is decomposed into the weighted sum of rank-one tensors based on the minimization of an objective function. Each rank-one tensor, formed by taking the outer product of factor vectors, constitutes a latent factor.
Figure 2
Figure 2
A graphical example of our SGranite: Suppose there are 2 workers, we will have 8 blocks and 4 strata after partition. We run this process iteratively until convergence. In each epoch, start from strata one, each worker runs SGD for its own assigned block in parallel. Check the convergence until all strata are iterated. We repeat above algorithm again if the stopping criteria is not satisfied. All intermediate results are saved as Resilient Distributed Datasets (RDD) collections and cached in memory.
Figure 3
Figure 3
A graphical example of one stratum training: Given one stratum of training data and factor matrices A(1), A(2), A(3), we run SGD on each block in parallel. Then factor matrices A(1), A(2), A(3) are updated and used as the initialization for the next stratum training.
Figure 4
Figure 4
Comparison of two distributed and two non-distributed CP models using KL divergence. SGranite converges in less epochs than the other methods. The negative KL divergence arises from the fact that the observed values are not probability measurements.
Figure 5
Figure 5
The speed-up curve for both datasets. It shows analysis of large datasets will gain an obvious speed up by using SGranite.
Figure 6
Figure 6
A comparison of the learned latent factors with and without constraints using R = 3. Year from 2003 to 2015
Figure 7
Figure 7
Latent factors obtained using Flexifact. Year from 2003 to 2015
Figure 8
Figure 8
This figure is downloaded from CDC, it shows the actual influenza positive tests reported to CDC in 2010–2011, week ending Oct 01, 2001.

Similar articles

Cited by

References

    1. Flu Season. [n. d.]. https://en.wikipedia.org/wiki/Flu_season.
    1. National Institutes of Health. [n. d.]. https://allofus.nih.gov/
    1. Acar Evrim, Dunlavy Daniel M, and Kolda Tamara G. 2011. A scalable optimization approach for fitting canonical tensor decompositions. Journal of Chemometrics 25, 2 (Feb. 2011), 67–86.
    1. Afshar Ardavan, Perros Ioakeim, Papalexakis Evangelos E, Searles Elizabeth, Ho Joyce C, and Sun Jimeng. 2018. COPA: Constrained PARAFAC2 for sparse & large datasets. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 793–802. - PMC - PubMed
    1. Arango MF and Mejia-Mantilla JH. 2006. Magnesium for acute traumatic brain injury - PubMed

LinkOut - more resources