Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Aug 14;21(1):357.
doi: 10.1186/s12859-020-03653-9.

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Affiliations

Comparison of methods for the detection of outliers and associated biomarkers in mislabeled omics data

Hongwei Sun et al. BMC Bioinformatics. .

Abstract

Background: Previous studies have reported that labeling errors are not uncommon in omics data. Potential outliers may severely undermine the correct classification of patients and the identification of reliable biomarkers for a particular disease. Three methods have been proposed to address the problem: sparse label-noise-robust logistic regression (Rlogreg), robust elastic net based on the least trimmed square (enetLTS), and Ensemble. Ensemble is an ensembled classification based on distinct feature selection and modeling strategies. The accuracy of biomarker selection and outlier detection of these methods needs to be evaluated and compared so that the appropriate method can be chosen.

Results: The accuracy of variable selection, outlier identification, and prediction of three methods (Ensemble, enetLTS, Rlogreg) were compared for simulated and an RNA-seq dataset. On simulated datasets, Ensemble had the highest variable selection accuracy, as measured by a comprehensive index, and lowest false discovery rate among the three methods. When the sample size was large and the proportion of outliers was ≤5%, the positive selection rate of Ensemble was similar to that of enetLTS. However, when the proportion of outliers was 10% or 15%, Ensemble missed some variables that affected the response variables. Overall, enetLTS had the best outlier detection accuracy with false positive rates < 0.05 and high sensitivity, and enetLTS still performed well when the proportion of outliers was relatively large. With 1% or 2% outliers, Ensemble showed high outlier detection accuracy, but with higher proportions of outliers Ensemble missed many mislabeled samples. Rlogreg and Ensemble were less accurate in identifying outliers than enetLTS. The prediction accuracy of enetLTS was better than that of Rlogreg. Running Ensemble on a subset of data after removing the outliers identified by enetLTS improved the variable selection accuracy of Ensemble.

Conclusions: When the proportion of outliers is ≤5%, Ensemble can be used for variable selection. When the proportion of outliers is > 5%, Ensemble can be used for variable selection on a subset after removing outliers identified by enetLTS. For outlier identification, enetLTS is the recommended method. In practice, the proportion of outliers can be estimated according to the inaccuracy of the diagnostic methods used.

Keywords: Ensemble; Feature selection; Mislabeled; Rlogreg; Robust; enetLTS.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Variable selection accuracy of Rlogreg, enetLTS, and Ensemble when n = 100. Abbreviations: PSR, Positive Selection Rate. FDR, False Discovery Rate
Fig. 2
Fig. 2
Variable selection accuracy of Rlogreg, enetLTS, and Ensemble when n = 500. Abbreviations: PSR, Positive Selection Rate. FDR, False Discovery Rate.
Fig. 3
Fig. 3
Outlier detection accuracy of Rlogreg, enetLTS, and Ensemble. Abbreviations: Sn, sensitivity. FPR, False Positive Rate
Fig. 4
Fig. 4
Prediction accuracy of Rlogreg, enetLTS. Abbreviations: MR, Misclassification Rate
Fig. 5
Fig. 5
Outlier detection accuracy for the simulated datasets based on the TNBC dataset. Abbreviations: Sn, sensitivity. FPR, False Positive Rate
Fig. 6
Fig. 6
The intersection of genes selected by Ensemble’s three models on the original TNBC dataset
Fig. 7
Fig. 7
The intersection of genes selected by Ensemble’s three methods on the subset with outliers removed

Similar articles

Cited by

References

    1. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320.
    1. Tibshirani R: Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 1996, 58:267–288.
    1. Wold S, Ruhe A, Wold H, Dunn I. WJ: the collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput. 1984;5(3):735–743.
    1. Bootkrajang J, Kaban A. Classification of mislabelled microarrays using robust sparse logistic regression. Bioinformatics. 2013;29(7):870–877. - PubMed
    1. Zhang C, Wu C, Blanzieri E, Zhou Y, Wang Y, Du W, Liang Y. Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics. 2009;25(20):2708–2714. - PubMed

LinkOut - more resources