EnsMOD: A Software Program for Omics Sample Outlier Detection

Nathan P Manes¹, Jian Song¹, Aleksandra Nita-Lazar¹

Affiliations

PMID: 37042708
PMCID: PMC10282819
DOI: 10.1089/cmb.2022.0243

EnsMOD: A Software Program for Omics Sample Outlier Detection

Nathan P Manes et al. J Comput Biol. 2023 Jun.

. 2023 Jun;30(6):726-735.

doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12.

Authors

Nathan P Manes¹, Jian Song¹, Aleksandra Nita-Lazar¹

Affiliation

¹ Laboratory of Immune System Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA.

PMID: 37042708
PMCID: PMC10282819
DOI: 10.1089/cmb.2022.0243

Abstract

Detection of omics sample outliers is important for preventing erroneous biological conclusions, developing robust experimental protocols, and discovering rare biological states. Two recent publications describe robust algorithms for detecting transcriptomic sample outliers, but neither algorithm had been incorporated into a software tool for scientists. Here we describe Ensemble Methods for Outlier Detection (EnsMOD) which incorporates both algorithms. EnsMOD calculates how closely the quantitation variation follows a normal distribution, plots the density curves of each sample to visualize anomalies, performs hierarchical cluster analyses to calculate how closely the samples cluster with each other, and performs robust principal component analyses to statistically test if any sample is an outlier. The probabilistic threshold parameters can be easily adjusted to tighten or loosen the outlier detection stringency. EnsMOD can be used to analyze any omics dataset with normally distributed variance. Here it was used to analyze a simulated proteomics dataset, a multiomic (proteome and transcriptome) dataset, a single-cell proteomics dataset, and a phosphoproteomics dataset. EnsMOD successfully identified all of the simulated outliers, and subsequent removal of a detected outlier improved data quality for downstream statistical analyses.

Keywords: hierarchical cluster analysis; multivariate; omics; outlier detection; proteomics; robust principal component analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no conflicting financial interests.

Figures

**FIG. 1.**
Outlier detection from analyzing the anthrax phosphoproteomics dataset. Mice were not injected at all, or were injected with vehicle alone, or with either the Sterne or the ΔSterne strain of *Bacillus anthracis* **(A)**. Fifteen mice were prepared for the Sterne 72 hours experimental condition, but only one reached 72 hours. Spleens were analyzed using LC-MS phosphoproteomics (Manes et al., 2011). Empirical histogram and density curve (black) plotted against the standard normal distribution (red) **(B)**. The R² value was calculated between the empirical density curve and the standard normal distribution. Density curve of the phosphopeptide abundance values for each sample **(C)**. Dendrogram of the samples from the HCA with the largest CCC **(D)**. Silhouette coefficients for each sample (black line is the cutoff, red dashed line is the mean calculated across all of the samples) **(E)**. Distance–Distance plots from the *robpca* and *PcaGrid* analyses **(F, G)**. Venn diagram of the differentially abundant phosphopeptides discovered using one-way ANOVAs (q-value ≤0.05), with (left circle) and without (right circle) including the genuine sample outlier (SplnC0_0003) **(H)**. CCC, cophenetic correlation coefficient; HCA, hierarchical cluster analysis; LC-MS, liquid chromatography–mass spectrometry.

See this image and copyright information in PMC

References

1. Aggarwal CC. Outlier Analysis, 2nd ed. Springer International Publishing AG: Cham, Switzerland; 2017; doi: 10.1007/978-3-319-47578-3 - DOI
1. Baran Y, Bercovich A, Sebe-Pedros A, et al. . MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol 2019;20:206; doi: 10.1186/s13059-019-1812-2 - DOI - PMC - PubMed
1. Boehm AM, Putz S, Altenhofer D, et al. . Precise protein quantification based on peptide quantification using iTRAQ. BMC Bioinformatics 2007;8:214; doi: 10.1186/1471-2105-8-214 - DOI - PMC - PubMed
1. Charrad M, Ghazzali N, Boiteau V, et al. . NbClust: An R package for determining the relevant number of clusters in a data set. J Stat Softw 2014;61:1–36; doi: 10.18637/jss.v061.i06 - DOI
1. Chen X, Zhang B, Wang T, et al. . Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics 2020;21:269; doi: 10.1186/s12859-020-03608-0 - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

EnsMOD: A Software Program for Omics Sample Outlier Detection

Affiliation

EnsMOD: A Software Program for Omics Sample Outlier Detection

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources