. 2014 Dec:52:199-211.

doi: 10.1016/j.jbi.2014.07.001. Epub 2014 Jul 16.

Limestone: high-throughput candidate phenotype generation via tensor factorization

Joyce C Ho¹, Joydeep Ghosh², Steve R Steinhubl³, Walter F Stewart⁴, Joshua C Denny⁵, Bradley A Malin⁶, Jimeng Sun⁷

Affiliations

¹ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States. Electronic address: joyceho@utexas.edu.
² Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.
³ Scripps Translational Science Institute, Scripps Health, La Jolla, CA 92037, United States.
⁴ Sutter Health Research, Development, and Dissemination Team, Sutter Health, Walnut Creek, CA 94598, United States.
⁵ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Medicine, Vanderbilt University, Nashville, TN 37232, United States.
⁶ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37232, United States.
⁷ School of Computational Science and Engineering at College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, United States.

PMID: 25038555
PMCID: PMC6563906
DOI: 10.1016/j.jbi.2014.07.001

Limestone: high-throughput candidate phenotype generation via tensor factorization

Joyce C Ho et al. J Biomed Inform. 2014 Dec.

. 2014 Dec:52:199-211.

doi: 10.1016/j.jbi.2014.07.001. Epub 2014 Jul 16.

Authors

Joyce C Ho¹, Joydeep Ghosh², Steve R Steinhubl³, Walter F Stewart⁴, Joshua C Denny⁵, Bradley A Malin⁶, Jimeng Sun⁷

Affiliations

¹ Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States. Electronic address: joyceho@utexas.edu.
² Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, United States.
³ Scripps Translational Science Institute, Scripps Health, La Jolla, CA 92037, United States.
⁴ Sutter Health Research, Development, and Dissemination Team, Sutter Health, Walnut Creek, CA 94598, United States.
⁵ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Medicine, Vanderbilt University, Nashville, TN 37232, United States.
⁶ Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, United States; Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37232, United States.
⁷ School of Computational Science and Engineering at College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, United States.

PMID: 25038555
PMCID: PMC6563906
DOI: 10.1016/j.jbi.2014.07.001

Abstract

The rapidly increasing availability of electronic health records (EHRs) from multiple heterogeneous sources has spearheaded the adoption of data-driven approaches for improved clinical research, decision making, prognosis, and patient management. Unfortunately, EHR data do not always directly and reliably map to medical concepts that clinical researchers need or use. Some recent studies have focused on EHR-derived phenotyping, which aims at mapping the EHR data to specific medical concepts; however, most of these approaches require labor intensive supervision from experienced clinical professionals. Furthermore, existing approaches are often disease-centric and specialized to the idiosyncrasies of the information technology and/or business practices of a single healthcare organization. In this paper, we propose Limestone, a nonnegative tensor factorization method to derive phenotype candidates with virtually no human supervision. Limestone represents the data source interactions naturally using tensors (a generalization of matrices). In particular, we investigate the interaction of diagnoses and medications among patients. The resulting tensor factors are reported as phenotype candidates that automatically reveal patient clusters on specific diagnoses and medications. Using the proposed method, multiple phenotypes can be identified simultaneously from data. We demonstrate the capability of Limestone on a cohort of 31,815 patient records from the Geisinger Health System. The dataset spans 7years of longitudinal patient records and was initially constructed for a heart failure onset prediction study. Our experiments demonstrate the robustness, stability, and the conciseness of Limestone-derived phenotypes. Our results show that using only 40 phenotypes, we can outperform the original 640 features (169 diagnosis categories and 471 medication types) to achieve an area under the receiver operator characteristic curve (AUC) of 0.720 (95% CI 0.715 to 0.725). Moreover, in consultation with a medical expert, we confirmed 82% of the top 50 candidates automatically extracted by Limestone are clinically meaningful.

Keywords: Dimensionality reduction; EHR phenotyping; Nonnegative tensor factorization.

PubMed Disclaimer

Figures

**Fig. 1**
EHR matrix representations and matrix factorization.

**Fig. 2**
Generating candidate phenotypes using CP tensor factorization.

**Fig. 3**
A high-level depiction of the Limestone process by which candidate phenotypes are generated and patients are projected onto the candidates.

**Fig. 4**
The observation window is defined as a fixed time window prior to the index date (e.g. diagnosis date) and is used to determine the data used for tensor construction. The medication orders in gray are excluded during feature construction because they are outside the observation window.

**Fig. 5**
An example of the kth candidate phenotype produced from the tensor factorization, and the interpretation of the tensor factorization result. The green text, blue, and red text correspond to non-zero elements in the patient, diagnosis, and medication factors, respectively. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 6**
A new patient’s phenotype membership vector is computed by projecting the new patient’s data onto the R candidate phenotypes in the purple dashed line. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

**Fig. 7**
Objective function and similarity scores as a function of the number of total iterations for the case patients tensor. The error bars indicate the 95% confidence interval.

**Fig. 8**
Similarity scores to the original tensor factorization results for perturbed versions of the case patient tensor.

**Fig. 9**
The distribution of non-zero element values for 50 Limestone-derived phenotypes.

**Fig. 10**
The number of non-zero entries per factor using a threshold of 0.05.

**Fig. 11**
The most significant Limestone-derived phenotype and a “similar” NMF-derived phenotype with several matching diagnosis and medications. The Limestone features are listed in descending order of the probabilistic values. The similar NMF features are listed first, before listing the features in descending order based on element value. The NMF threshold was adjusted to 0.001 to maintain similarities with the Limestone-derived phenotype.

**Fig. 12**
Area under the receiver operating characteristic curve for the four feature sets while varying the number of phenotypes. The error bars denote the 95% confidence interval and the dashed lines illustrated the confidence interval using the baseline feature set.

**Fig. 13**
The top five Limestone-derived phenotypes using the control patients’ tensor.

**Fig. 14**
Limestone-derived phenotypes from the control patients’ tensor relating to hypertension.

See this image and copyright information in PMC

References

1. Davis J, Lantz E, Page D, Struyf J, Peissig P, Vidaillet H, et al. Machine learning for personalized medicine: will this drug give me a heart attack. ICML workshop on machine learning for health care applications; 2008.
1. Ramakrishnan N, Hanauer D, Keller B. Mining electronic health records. Computer. 2010;43:77–81.
1. Koh HC, Tan G. Data mining applications in healthcare. J Healthcare Inform Manage. 2005;19:64–72. - PubMed
1. Wagholikar KBK, Maclaughlin KLK, Henry MRM, Greenes RAR, Hankey RAR, Liu HH, et al. Clinical decision support with automated text processing for cervical cancer screening. J Am Med Inform Assoc. 2012;19:833–9. - PMC - PubMed
1. Davis D, Chawla N, Christakis NA, Barabási A. Time to CARE: a collaborative engine for practical disease prediction. Data Min Knowl Discov. 2010;20:388–415.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 HL116832/HL/NHLBI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Limestone: high-throughput candidate phenotype generation via tensor factorization

Affiliations

Limestone: high-throughput candidate phenotype generation via tensor factorization

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources