. 2021 Dec 22:2021:9436582.

doi: 10.1155/2021/9436582. eCollection 2021.

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Hongwei Sun^{1

2}, Jiu Wang¹, Zhongwen Zhang¹, Naibao Hu¹, Tong Wang²

Affiliations

¹ Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China.
² Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, Shanxi 030001, China.

PMID: 34976114
PMCID: PMC8716222
DOI: 10.1155/2021/9436582

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Hongwei Sun et al. Comput Math Methods Med. 2021.

. 2021 Dec 22:2021:9436582.

doi: 10.1155/2021/9436582. eCollection 2021.

Authors

Hongwei Sun^{1

2}, Jiu Wang¹, Zhongwen Zhang¹, Naibao Hu¹, Tong Wang²

Affiliations

¹ Department of Health Statistics, School of Public Health and Management, Binzhou Medical University, Yantai City, Shandong 264003, China.
² Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan City, Shanxi 030001, China.

PMID: 34976114
PMCID: PMC8716222
DOI: 10.1155/2021/9436582

Abstract

High dimensionality and noise have made it difficult to detect related biomarkers in omics data. Through previous study, penalized maximum trimmed likelihood estimation is effective in identifying mislabeled samples in high-dimensional data with mislabeled error. However, the algorithm commonly used in these studies is the concentration step (C-step), and the C-step algorithm that is applied to robust penalized regression does not ensure that the criterion function is gradually optimized iteratively, because the regularized parameters change during the iteration. This makes the C-step algorithm runs very slowly, especially when dealing with high-dimensional omics data. The AR-Cstep (C-step combined with an acceptance-rejection scheme) algorithm is proposed. In simulation experiments, the AR-Cstep algorithm converged faster (the average computation time was only 2% of that of the C-step algorithm) and was more accurate in terms of variable selection and outlier identification than the C-step algorithm. The two algorithms were further compared on triple negative breast cancer (TNBC) RNA-seq data. AR-Cstep can solve the problem of the C-step not converging and ensures that the iterative process is in the direction that improves criterion function. As an improvement of the C-step algorithm, the AR-Cstep algorithm can be extended to other robust models with regularized parameters.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no conflicts of interests.

Figures

**Figure 1**
Results of MTL-EN, enetLTS, and Ensemble when n = 500 and p = 1000. Sn: sensitivity; FPR: False Positive Rate; PSR: Positive Selection Rate; FDR: False Discovery Rate.

**Algorithm 1**
Description of C-step algorithm.

**Algorithm 2**
Description of AR-Cstep algorithm.

See this image and copyright information in PMC

References

1. Lopes M. B., Verissimo A., Carrasquinha E., Casimiro S., Beerenwinkel N., Vinga S. Ensemble outlier detection and gene selection in triple-negative breast cancer data. BMC Bioinformatics . 2018;19(1):p. 168. doi: 10.1186/s12859-018-2149-7. - DOI - PMC - PubMed
1. Wu C., Ma S. A selective review of robust variable selection with applications in bioinformatics. Briefings in Bioinformatics . 2015;16(5):873–883. doi: 10.1093/bib/bbu046. - DOI - PMC - PubMed
1. Ayers K. L., Cordell H. J. SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genetic Epidemiology . 2010;34(8):879–891. doi: 10.1002/gepi.20543. - DOI - PMC - PubMed
1. Sun H., Wang S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics . 2012;28(10):1368–1375. doi: 10.1093/bioinformatics/bts145. - DOI - PMC - PubMed
1. Rousseeuw P. J. Least median of squares regression. Journal of the American Statistical Association . 1984;79(388):871–880. doi: 10.1080/01621459.1984.10477105. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Affiliations

An Efficient Algorithm for the Detection of Outliers in Mislabeled Omics Data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Research Materials