Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun;30(6):726-735.
doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12.

EnsMOD: A Software Program for Omics Sample Outlier Detection

Affiliations

EnsMOD: A Software Program for Omics Sample Outlier Detection

Nathan P Manes et al. J Comput Biol. 2023 Jun.

Abstract

Detection of omics sample outliers is important for preventing erroneous biological conclusions, developing robust experimental protocols, and discovering rare biological states. Two recent publications describe robust algorithms for detecting transcriptomic sample outliers, but neither algorithm had been incorporated into a software tool for scientists. Here we describe Ensemble Methods for Outlier Detection (EnsMOD) which incorporates both algorithms. EnsMOD calculates how closely the quantitation variation follows a normal distribution, plots the density curves of each sample to visualize anomalies, performs hierarchical cluster analyses to calculate how closely the samples cluster with each other, and performs robust principal component analyses to statistically test if any sample is an outlier. The probabilistic threshold parameters can be easily adjusted to tighten or loosen the outlier detection stringency. EnsMOD can be used to analyze any omics dataset with normally distributed variance. Here it was used to analyze a simulated proteomics dataset, a multiomic (proteome and transcriptome) dataset, a single-cell proteomics dataset, and a phosphoproteomics dataset. EnsMOD successfully identified all of the simulated outliers, and subsequent removal of a detected outlier improved data quality for downstream statistical analyses.

Keywords: hierarchical cluster analysis; multivariate; omics; outlier detection; proteomics; robust principal component analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no conflicting financial interests.

Figures

FIG. 1.
FIG. 1.
Outlier detection from analyzing the anthrax phosphoproteomics dataset. Mice were not injected at all, or were injected with vehicle alone, or with either the Sterne or the ΔSterne strain of Bacillus anthracis (A). Fifteen mice were prepared for the Sterne 72 hours experimental condition, but only one reached 72 hours. Spleens were analyzed using LC-MS phosphoproteomics (Manes et al., 2011). Empirical histogram and density curve (black) plotted against the standard normal distribution (red) (B). The R2 value was calculated between the empirical density curve and the standard normal distribution. Density curve of the phosphopeptide abundance values for each sample (C). Dendrogram of the samples from the HCA with the largest CCC (D). Silhouette coefficients for each sample (black line is the cutoff, red dashed line is the mean calculated across all of the samples) (E). Distance–Distance plots from the robpca and PcaGrid analyses (F, G). Venn diagram of the differentially abundant phosphopeptides discovered using one-way ANOVAs (q-value ≤0.05), with (left circle) and without (right circle) including the genuine sample outlier (SplnC0_0003) (H). CCC, cophenetic correlation coefficient; HCA, hierarchical cluster analysis; LC-MS, liquid chromatography–mass spectrometry.
FIG. 1.
FIG. 1.
Outlier detection from analyzing the anthrax phosphoproteomics dataset. Mice were not injected at all, or were injected with vehicle alone, or with either the Sterne or the ΔSterne strain of Bacillus anthracis (A). Fifteen mice were prepared for the Sterne 72 hours experimental condition, but only one reached 72 hours. Spleens were analyzed using LC-MS phosphoproteomics (Manes et al., 2011). Empirical histogram and density curve (black) plotted against the standard normal distribution (red) (B). The R2 value was calculated between the empirical density curve and the standard normal distribution. Density curve of the phosphopeptide abundance values for each sample (C). Dendrogram of the samples from the HCA with the largest CCC (D). Silhouette coefficients for each sample (black line is the cutoff, red dashed line is the mean calculated across all of the samples) (E). Distance–Distance plots from the robpca and PcaGrid analyses (F, G). Venn diagram of the differentially abundant phosphopeptides discovered using one-way ANOVAs (q-value ≤0.05), with (left circle) and without (right circle) including the genuine sample outlier (SplnC0_0003) (H). CCC, cophenetic correlation coefficient; HCA, hierarchical cluster analysis; LC-MS, liquid chromatography–mass spectrometry.

Similar articles

Cited by

References

    1. Aggarwal CC. Outlier Analysis, 2nd ed. Springer International Publishing AG: Cham, Switzerland; 2017; doi: 10.1007/978-3-319-47578-3 - DOI
    1. Baran Y, Bercovich A, Sebe-Pedros A, et al. . MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol 2019;20:206; doi: 10.1186/s13059-019-1812-2 - DOI - PMC - PubMed
    1. Boehm AM, Putz S, Altenhofer D, et al. . Precise protein quantification based on peptide quantification using iTRAQ. BMC Bioinformatics 2007;8:214; doi: 10.1186/1471-2105-8-214 - DOI - PMC - PubMed
    1. Charrad M, Ghazzali N, Boiteau V, et al. . NbClust: An R package for determining the relevant number of clusters in a data set. J Stat Softw 2014;61:1–36; doi: 10.18637/jss.v061.i06 - DOI
    1. Chen X, Zhang B, Wang T, et al. . Robust principal component analysis for accurate outlier sample detection in RNA-Seq data. BMC Bioinformatics 2020;21:269; doi: 10.1186/s12859-020-03608-0 - DOI - PMC - PubMed

Publication types

LinkOut - more resources