Clustering datasets with demographics and diagnosis codes

Haodi Zhong¹, Grigorios Loukides², Robert Gwadera³

Affiliations

¹ Department of Informatics, King's College London, London, UK. Electronic address: haodi.zhong@kcl.ac.uk.
² Department of Informatics, King's College London, London, UK. Electronic address: grigorios.loukides@kcl.ac.uk.
³ School of Computer Science, Cardiff University, Cardiff, UK. Electronic address: GwaderaR@cs.cardiff.ac.uk.

PMID: 31904428
DOI: 10.1016/j.jbi.2019.103360

Free article

Clustering datasets with demographics and diagnosis codes

Haodi Zhong et al. J Biomed Inform. 2020 Feb.

Free article

. 2020 Feb:102:103360.

doi: 10.1016/j.jbi.2019.103360. Epub 2020 Jan 3.

Authors

Haodi Zhong¹, Grigorios Loukides², Robert Gwadera³

Affiliations

¹ Department of Informatics, King's College London, London, UK. Electronic address: haodi.zhong@kcl.ac.uk.
² Department of Informatics, King's College London, London, UK. Electronic address: grigorios.loukides@kcl.ac.uk.
³ School of Computer Science, Cardiff University, Cardiff, UK. Electronic address: GwaderaR@cs.cardiff.ac.uk.

PMID: 31904428
DOI: 10.1016/j.jbi.2019.103360

Abstract

Clustering data derived from Electronic Health Record (EHR) systems is important to discover relationships between the clinical profiles of patients and as a preprocessing step for analysis tasks, such as classification. However, the heterogeneity of these data makes the application of existing clustering methods difficult and calls for new clustering approaches. In this paper, we propose the first approach for clustering a dataset in which each record contains a patient's values in demographic attributes and their set of diagnosis codes. Our approach represents the dataset in a binary form in which the features are selected demographic values, as well as combinations (patterns) of frequent and correlated diagnosis codes. This representation enables measuring similarity between records using cosine similarity, an effective measure for binary-represented data, and finding compact, well-separated clusters through hierarchical clustering. Our experiments using two publicly available EHR datasets, comprised of over 26,000 and 52,000 records, demonstrate that our approach is able to construct clusters with correlated demographics and diagnosis codes, and that it is efficient and scalable.

Keywords: Clustering; Demographics; Diagnosis codes; Pattern mining.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Clustering datasets with demographics and diagnosis codes

Affiliations

Clustering datasets with demographics and diagnosis codes

Authors

Affiliations

Abstract

Conflict of interest statement

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources