Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 1;39(8):btad501.
doi: 10.1093/bioinformatics/btad501.

Anomaly detection in mixed high-dimensional molecular data

Affiliations

Anomaly detection in mixed high-dimensional molecular data

Lena Buck et al. Bioinformatics. .

Abstract

Motivation: Mixed molecular data combines continuous and categorical features of the same samples, such as OMICS profiles with genotypes, diagnoses, or patient sex. Like all high-dimensional molecular data, it is prone to incorrect values that can stem from various sources for example the technical limitations of the measurement devices, errors in the sample preparation, or contamination. Most anomaly detection algorithms identify complete samples as outliers or anomalies. However, in most cases, not all measurements of those samples are erroneous but only a few one-dimensional features within the samples are incorrect. These one-dimensional data errors are continuous measurements that are either located outside or inside the normal ranges of their features but in both cases show atypical values given all other continuous and categorical features in the sample. Additionally, categorical anomalies can occur for example when the genotype or diagnosis was submitted wrongly.

Results: We introduce ADMIRE (Anomaly Detection using MIxed gRaphical modEls), a novel approach for the detection and correction of anomalies in mixed high-dimensional data. Hereby, we focus on the detection of single (one-dimensional) data errors in the categorical and continuous features of a sample. For that the joint distribution of continuous and categorical features is learned by mixed graphical models, anomalies are detected by the difference between measured and model-based estimations and are corrected using imputation. We evaluated ADMIRE in simulation and by screening for anomalies in one of our own metabolic datasets. In simulation experiments, ADMIRE outperformed the state-of-the-art methods of Local Outlier Factor, stray, and Isolation Forest.

Availability and implementation: All data and code is available at https://github.com/spang-lab/adadmire. ADMIRE is implemented in a Python package called adadmire which can be found at https://pypi.org/project/adadmire.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
A mixed graphical model. The nodes include both continuous features (X1, ..., X5) and discrete features (Y1and Y2). A missing edge between two nodes denotes their conditional independence given all other variables. The node and edge weights correspond to the couplings and potentials in equation (1).
Figure 2.
Figure 2.
Observed and random scores for the dataset containing artificial discrete anomalies and estimated probabilities for the categorical variables split in the respective binary states across the according samples. (A) Highest ranking observed and random scores, artificial anomalies are marked in red, the threshold is marked in green. (B) Estimated probabilities for behavior (C/S or S/C). (C) Estimated probabilities for genotype (control/trisomic). (D) Estimated probabilities for treatment with treatment either Memantine or Saline.
Figure 3.
Figure 3.
Influence of the parameter ϵ on the strength of the anomalies in protein pNR2A_N. Black dots indicate introduced anomalies.
Figure 4.
Figure 4.
Precision–Recall curves for the simulations with 2.5% and 5% contamination. (A) PR curves of ADMIRE on log-transformed data without correcting for intrinsic outliers. (B) PR curves of ADMIRE on log-transformed simulations corrected for intrinsic outliers.
Figure 5.
Figure 5.
(A) Scaled, originally measured concentrations of sample 7 (red) with all other samples in the same MYC group (green), detected anomalies are marked as black diamonds. The features (metabolites) on the x-axis are ordered according to the different quantification methods. (B) Scaled, originally measured concentrations of sample 92 (red) with all other samples in the same MYC group (green), detected anomalies are marked as black diamonds. The features (metabolites) on the x-axis are ordered according to the different quantification methods.

References

    1. Altenbuchinger M, Weihs A, Quackenbush J. et al. Gaussian and mixed graphical models as (multi-)omics data analysis tools. Biochim Biophys Acta Gene Regul Mech 2020;1863:194418. - PMC - PubMed
    1. Altenbuchinger M, Zacharias HU, Solbrig S. et al. A multi-source data integration approach reveals novel associations between metabolites and renal outcomes in the German chronic kidney disease study. Sci Rep 2019;9:13954. - PMC - PubMed
    1. Ando S. Clustering needles in a haystack: an information theoretic analysis of minority and outlier detection. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), btad501. IEEE 2007, pp. 13–22.
    1. Breunig MM, Kriegel H-P, Ng RT. et al. Lof. SIGMOD Rec 2000;29:93–104.
    1. Cheng J, Li T, Levina E. et al. High-dimensional mixed graphical models. J Comput Graph Stat 2017;26:367–78.

Publication types