Pattern discovery and disentanglement on relational datasets

Andrew K C Wong¹, Pei-Yuan Zhou², Zahid A Butt³

Affiliations

¹ Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
² Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada. p44zhou@uwaterloo.ca.
³ School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada.

PMID: 33707478
PMCID: PMC7952710
DOI: 10.1038/s41598-021-84869-4

Pattern discovery and disentanglement on relational datasets

Andrew K C Wong et al. Sci Rep. 2021.

. 2021 Mar 11;11(1):5688.

doi: 10.1038/s41598-021-84869-4.

Authors

Andrew K C Wong¹, Pei-Yuan Zhou², Zahid A Butt³

Affiliations

¹ Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada.
² Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada. p44zhou@uwaterloo.ca.
³ School of Public Health and Health Systems, University of Waterloo, Waterloo, ON, Canada.

PMID: 33707478
PMCID: PMC7952710
DOI: 10.1038/s41598-021-84869-4

Abstract

Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Figure 1**
Overview of PDD. The figure describes the key ideas of the new paradigm and the algorithmic steps of PDD.

**Figure 2**
Pattern discovery and disentanglement experiment on an imbalanced APC dataset. (a) AVs and patterns discovered by traditional Pattern-Mining Algorithm (Apriori) from different classes are entangled as shown in shaded grey. (b) The summarized and comprehensive patterns discovered by PDD reside in distinct DSs associated with distinct taxonomic groups or source environments. The small “Insect” group with pattern [S72 = F, S96 = N] is found in DS3. (c) The results of PDD on the same set of data without class labels given produces almost identical results, indicating that PDD does not need prior knowledge to differentiate taxonomic classes in this case (see Supplement 2).

**Figure 3**
Result of pattern clustering and entity clustering on an APC representing a functional region of a protein family. (a) An APC obtained from a protein family. (b) Pattern clusters from an APC of Class A Scavenger Receptors. Patterns shown in different color shades are associated with 5 distinct classes. While K-Means could not separate *Marco* from *Scara5* and *Sra* in the collagenous domain, PDD separated *Marco* from *Scara5* in DS3 (DSU[5 1 1] and DSU[5 2 1] respectively). (c) Clustering scores of PDD and K-Means. PDD results are far superior to those of K-Means.

**Figure 4**
PDD knowledge base (PDDKB) for Wisconsin breast cancer dataset. (a) The inserted patterns for two groups of rare cases. Data quantization put each AV with small variation into the same interval. (b) Summary PDDKB. In the DSs, each DS Unit (DSU) (such as DSU[1 1 2] on the second row) represents SubPG2 of PG1 in DS1. The summary patterns summarize all the AV-Clusters/Patterns listed in the DSU in the Comprehensive PDDKB. For instance, the AVs in DSU[1 1 2] represent the union of all AV clusters (or patterns) found in that unit in the Comprehensive PDDKB. (c) Comprehensive PDDKB. Each pattern in a DSU links to a list of individual entities (denoted by ‘1’) in the column representing an entity with EID and class label (if given). In the Summary KB, the numeral on each column (like 8 associating with E37) denotes the number of patterns/pattern-clusters discovered from the DSU[1 1 2] for that entity. In the Comprehensive KB, on the same column, a numeral of “1” is displayed on the row containing a special AV cluster (or pattern) that the entity possesses.

**Figure 5**
Supervised classification results of PDD, SVM and ANN on heart disease dataset. (a) Summary PDDKB and Comprehensive PDDKB were obtained. The blue blocks partition each into Disentangled, Pattern and Entity Spaces. The mislabeled entities E122 and E131 were discovered and displayed in the Entity Space since they were labeled as “Absence” but possessed patterns pertaining to “Presence”. (b) Attributes description of the Heart Disease Dataset. (c) Comparative rate of classification (with 80% of data for each class was selected randomly as training data and the rest (20%) as testing data by tenfold validation with variance) of PDD and other two existing ML models. After anomaly removal, the classification results of all the three models were improved approximately 10%. Such improvement cannot be realized without PDD anomaly removal and ground truth rectification process. (d) Entity Clustering Result showing mislabeled entities. In this case all anomalies were found among the “Absence” group but none in the “Presence” group.

See this image and copyright information in PMC

References

1. Voosen, P. How AI detectives are cracking open the black box of deep learning. Science. https://www.sciencemag.org/news/2017/07/how-ai-detectives-are-cracking-o... (2017).
1. Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019;25(1):44–56. doi: 10.1038/s41591-018-0300-7. - DOI - PubMed
1. Samek, W., Wiegand, T. & Müller, K. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint, arXiv:1708.08296 (2017).
1. Aggarwal, C. & Sathe, S. Bias reduction in outlier ensembles: the guessing game. In Outlier Ensembles (Springer, 2017). 10.1007/978-3-319-54765-7_4
1. Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 2016;46(3):563–597. doi: 10.1007/s10844-015-0368-1. - DOI

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Pattern discovery and disentanglement on relational datasets

Affiliations

Pattern discovery and disentanglement on relational datasets

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources