. 2020 Aug 14;21(1):357.

doi: 10.1186/s12859-020-03653-9.

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Hongwei Sun^{1

2}, Yuehua Cui³, Hui Wang¹, Haixia Liu², Tong Wang⁴

Affiliations

¹ Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.
² Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China.
³ Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA.
⁴ Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China. tongwang@sxmu.edu.cn.

PMID: 32795265
PMCID: PMC7646480
DOI: 10.1186/s12859-020-03653-9

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Hongwei Sun et al. BMC Bioinformatics. 2020.

. 2020 Aug 14;21(1):357.

doi: 10.1186/s12859-020-03653-9.

Authors

Hongwei Sun^{1

2}, Yuehua Cui³, Hui Wang¹, Haixia Liu², Tong Wang⁴

Affiliations

¹ Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China.
² Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, City, Yantai, 264003, Shandong, China.
³ Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA.
⁴ Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, 030001, Shanxi, China. tongwang@sxmu.edu.cn.

PMID: 32795265
PMCID: PMC7646480
DOI: 10.1186/s12859-020-03653-9

Abstract

Background: Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen.

Results: The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble.

Conclusions: When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

Keywords: Ensemble; Feature selection; Mislabeled; Rlogreg; Robust; enetLTS.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Variable selection accuracy of Rlogreg, enetLTS, and Ensemble when n = 100. **Abbreviations**: *PSR*, Positive Selection Rate. *FDR*, False Discovery Rate

**Fig. 2**
Variable selection accuracy of Rlogreg, enetLTS, and Ensemble when n = 500. **Abbreviations**: *PSR*, Positive Selection Rate. *FDR*, False Discovery Rate.

**Fig. 3**
Outlier detection accuracy of Rlogreg, enetLTS, and Ensemble. Abbreviations: Sn, sensitivity. *FPR*, False Positive Rate

**Fig. 4**
Prediction accuracy of Rlogreg, enetLTS. Abbreviations: MR, Misclassification Rate

**Fig. 5**
Outlier detection accuracy for the simulated datasets based on the TNBC dataset. Abbreviations: Sn, sensitivity. *FPR*, False Positive Rate

**Fig. 6**
The intersection of genes selected by Ensemble’s three models on the original TNBC dataset

**Fig. 7**
The intersection of genes selected by Ensemble’s three methods on the subset with outliers removed

See this image and copyright information in PMC

Cited by

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data.
Sun H, Wang J, Zhang Z, Hu N, Wang T. Sun H, et al. Comput Math Methods Med. 2021 Dec 22;2021:9436582. doi: 10.1155/2021/9436582. eCollection 2021. Comput Math Methods Med. 2021. PMID: 34976114 Free PMC article.
Biases in machine-learning models of human single-cell data.
Willem T, Shitov VA, Luecken MD, Kilbertus N, Bauer S, Piraud M, Buyx A, Theis FJ. Willem T, et al. Nat Cell Biol. 2025 Mar;27(3):384-392. doi: 10.1038/s41556-025-01619-8. Epub 2025 Feb 19. Nat Cell Biol. 2025. PMID: 39972066 Review.
TidyMass an object-oriented reproducible analysis framework for LC-MS data.
Shen X, Yan H, Wang C, Gao P, Johnson CH, Snyder MP. Shen X, et al. Nat Commun. 2022 Jul 28;13(1):4365. doi: 10.1038/s41467-022-32155-w. Nat Commun. 2022. PMID: 35902589 Free PMC article.
Glucose Sensing in Human Whole Blood Based on Near-Infrared Phosphors and Outlier Treatment with the Programming Language "R".
Lee HA, Lin PY, Solomatina AI, Koshevoy IO, Tunik SP, Lin HW, Pan SW, Ho ML. Lee HA, et al. ACS Omega. 2021 Dec 20;7(1):198-206. doi: 10.1021/acsomega.1c04344. eCollection 2022 Jan 11. ACS Omega. 2021. PMID: 35036691 Free PMC article.
EnsMOD: A Software Program for Omics Sample Outlier Detection.
Manes NP, Song J, Nita-Lazar A. Manes NP, et al. J Comput Biol. 2023 Jun;30(6):726-735. doi: 10.1089/cmb.2022.0243. Epub 2023 Apr 12. J Comput Biol. 2023. PMID: 37042708 Free PMC article.

See all "Cited by" articles

References

1. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320.
1. Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1996, 58:267–288.
1. Wold S, Ruhe A, Wold H, Dunn I. WJ: the collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput. 1984;5(3):735–743.
1. Bootkrajang J, Kaban A. Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics. 2013;29(7):870–877. - PubMed
1. Zhang C, Wu C, Blanzieri E, Zhou Y, Wang Y, Du W, Liang Y. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics. 2009;25(20):2708–2714. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Affiliations

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources