Anomaly detection in mixed high-dimensional molecular data

Affiliations

¹ Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany.
² Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany.
³ Institute of Functional Genomics, University of Regensburg, 93040 Regensburg, Germany.
⁴ Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover Medical School, 30625 Hannover, Germany.
⁵ Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.

PMID: 37584673
PMCID: PMC10457663
DOI: 10.1093/bioinformatics/btad501

Anomaly detection in mixed high-dimensional molecular data

Lena Buck et al. Bioinformatics. 2023.

. 2023 Aug 1;39(8):btad501.

doi: 10.1093/bioinformatics/btad501.

Affiliations

¹ Department of Statistical Bioinformatics, University of Regensburg, 93040 Regensburg, Germany.
² Department of Hematology and Medical Oncology, University Medicine Gottingen, 37075 Gottingen, Germany.
³ Institute of Functional Genomics, University of Regensburg, 93040 Regensburg, Germany.
⁴ Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover Medical School, 30625 Hannover, Germany.
⁵ Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.

PMID: 37584673
PMCID: PMC10457663
DOI: 10.1093/bioinformatics/btad501

Abstract

Motivation: Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly.

Results: We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest.

Availability and implementation: All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

**Figure 1.**
A mixed graphical model. The nodes include both continuous features ( $X_{1}, . . ., X_{5}$ ) and discrete features ( $Y_{1}$ and $Y_{2}$ ). A missing edge between two nodes denotes their conditional independence given all other variables. The node and edge weights correspond to the couplings and potentials in equation (1).

**Figure 2.**
Observed and random scores for the dataset containing artificial discrete anomalies and estimated probabilities for the categorical variables split in the respective binary states across the according samples. (A) Highest ranking observed and random scores, artificial anomalies are marked in red, the threshold is marked in green. (B) Estimated probabilities for behavior (C/S or S/C). (C) Estimated probabilities for genotype (control/trisomic). (D) Estimated probabilities for treatment with treatment either Memantine or Saline.

**Figure 3.**
Influence of the parameter ϵ on the strength of the anomalies in protein pNR2A_N. Black dots indicate introduced anomalies.

**Figure 4.**
Precision–Recall curves for the simulations with 2.5% and 5% contamination. (A) PR curves of ADMIRE on log-transformed data without correcting for intrinsic outliers. (B) PR curves of ADMIRE on log-transformed simulations corrected for intrinsic outliers.

**Figure 5.**
(A) Scaled, originally measured concentrations of sample 7 (red) with all other samples in the same MYC group (green), detected anomalies are marked as black diamonds. The features (metabolites) on the x-axis are ordered according to the different quantification methods. (B) Scaled, originally measured concentrations of sample 92 (red) with all other samples in the same MYC group (green), detected anomalies are marked as black diamonds. The features (metabolites) on the x-axis are ordered according to the different quantification methods.

See this image and copyright information in PMC

References

1. Altenbuchinger M, Weihs A, Quackenbush J. et al. Gaussian and mixed graphical models as (multi-)omics data analysis tools. Biochim Biophys Acta Gene Regul Mech 2020;1863:194418. - PMC - PubMed
1. Altenbuchinger M, Zacharias HU, Solbrig S. et al. A multi-source data integration approach reveals novel associations between metabolites and renal outcomes in the German chronic kidney disease study. Sci Rep 2019;9:13954. - PMC - PubMed
1. Ando S. Clustering needles in a haystack: an information theoretic analysis of minority and outlier detection. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), btad501. IEEE 2007, pp. 13–22.
1. Breunig MM, Kriegel H-P, Ng RT. et al. Lof. SIGMOD Rec 2000;29:93–104.
1. Cheng J, Li T, Levina E. et al. High-dimensional mixed graphical models. J Comput Graph Stat 2017;26:367–78.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Anomaly detection in mixed high-dimensional molecular data

Affiliations

Anomaly detection in mixed high-dimensional molecular data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous