Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 11;11(1):5688.
doi: 10.1038/s41598-021-84869-4.

Pattern discovery and disentanglement on relational datasets

Affiliations

Pattern discovery and disentanglement on relational datasets

Andrew K C Wong et al. Sci Rep. .

Abstract

Machine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level. Hence, we have developed Pattern Discovery and Disentanglement System (PDD), which is able to discover explicit patterns from the data with various sizes, imbalanced groups, and screen out anomalies. We present herein four case studies on biomedical datasets to substantiate the efficacy of PDD. It improves prediction accuracy and facilitates transparent interpretation of discovered knowledge in an explicit representation framework PDD Knowledge Base that links the sources, the patterns, and individual patients. Hence, PDD promises broad and ground-breaking applications in genomic and biomedical machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview of PDD. The figure describes the key ideas of the new paradigm and the algorithmic steps of PDD.
Figure 2
Figure 2
Pattern discovery and disentanglement experiment on an imbalanced APC dataset. (a) AVs and patterns discovered by traditional Pattern-Mining Algorithm (Apriori) from different classes are entangled as shown in shaded grey. (b) The summarized and comprehensive patterns discovered by PDD reside in distinct DSs associated with distinct taxonomic groups or source environments. The small “Insect” group with pattern [S72 = F, S96 = N] is found in DS3. (c) The results of PDD on the same set of data without class labels given produces almost identical results, indicating that PDD does not need prior knowledge to differentiate taxonomic classes in this case (see Supplement 2).
Figure 3
Figure 3
Result of pattern clustering and entity clustering on an APC representing a functional region of a protein family. (a) An APC obtained from a protein family. (b) Pattern clusters from an APC of Class A Scavenger Receptors. Patterns shown in different color shades are associated with 5 distinct classes. While K-Means could not separate Marco from Scara5 and Sra in the collagenous domain, PDD separated Marco from Scara5 in DS3 (DSU[5 1 1] and DSU[5 2 1] respectively). (c) Clustering scores of PDD and K-Means. PDD results are far superior to those of K-Means.
Figure 4
Figure 4
PDD knowledge base (PDDKB) for Wisconsin breast cancer dataset. (a) The inserted patterns for two groups of rare cases. Data quantization put each AV with small variation into the same interval. (b) Summary PDDKB. In the DSs, each DS Unit (DSU) (such as DSU[1 1 2] on the second row) represents SubPG2 of PG1 in DS1. The summary patterns summarize all the AV-Clusters/Patterns listed in the DSU in the Comprehensive PDDKB. For instance, the AVs in DSU[1 1 2] represent the union of all AV clusters (or patterns) found in that unit in the Comprehensive PDDKB. (c) Comprehensive PDDKB. Each pattern in a DSU links to a list of individual entities (denoted by ‘1’) in the column representing an entity with EID and class label (if given). In the Summary KB, the numeral on each column (like 8 associating with E37) denotes the number of patterns/pattern-clusters discovered from the DSU[1 1 2] for that entity. In the Comprehensive KB, on the same column, a numeral of “1” is displayed on the row containing a special AV cluster (or pattern) that the entity possesses.
Figure 5
Figure 5
Supervised classification results of PDD, SVM and ANN on heart disease dataset. (a) Summary PDDKB and Comprehensive PDDKB were obtained. The blue blocks partition each into Disentangled, Pattern and Entity Spaces. The mislabeled entities E122 and E131 were discovered and displayed in the Entity Space since they were labeled as “Absence” but possessed patterns pertaining to “Presence”. (b) Attributes description of the Heart Disease Dataset. (c) Comparative rate of classification (with 80% of data for each class was selected randomly as training data and the rest (20%) as testing data by tenfold validation with variance) of PDD and other two existing ML models. After anomaly removal, the classification results of all the three models were improved approximately 10%. Such improvement cannot be realized without PDD anomaly removal and ground truth rectification process. (d) Entity Clustering Result showing mislabeled entities. In this case all anomalies were found among the “Absence” group but none in the “Presence” group.

Similar articles

Cited by

References

    1. Voosen, P. How AI detectives are cracking open the black box of deep learning. Science. https://www.sciencemag.org/news/2017/07/how-ai-detectives-are-cracking-o... (2017).
    1. Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019;25(1):44–56. doi: 10.1038/s41591-018-0300-7. - DOI - PubMed
    1. Samek, W., Wiegand, T. & Müller, K. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv preprint, arXiv:1708.08296 (2017).
    1. Aggarwal, C. & Sathe, S. Bias reduction in outlier ensembles: the guessing game. In Outlier Ensembles (Springer, 2017). 10.1007/978-3-319-54765-7_4
    1. Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 2016;46(3):563–597. doi: 10.1007/s10844-015-0368-1. - DOI

Publication types

LinkOut - more resources