Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May;31(5):947-958.
doi: 10.1177/09622802211072456. Epub 2022 Jan 24.

ROSIE: RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data

Affiliations

ROSIE: RObust Sparse ensemble for outlIEr detection and gene selection in cancer omics data

Antje Jensch et al. Stat Methods Med Res. 2022 May.

Abstract

The extraction of novel information from omics data is a challenging task, in particular, since the number of features (e.g. genes) often far exceeds the number of samples. In such a setting, conventional parameter estimation leads to ill-posed optimization problems, and regularization may be required. In addition, outliers can largely impact classification accuracy.Here we introduce ROSIE, an ensemble classification approach, which combines three sparse and robust classification methods for outlier detection and feature selection and further performs a bootstrap-based validity check. Outliers of ROSIE are determined by the rank product test using outlier rankings of all three methods, and important features are selected as features commonly selected by all methods.We apply ROSIE to RNA-Seq data from The Cancer Genome Atlas (TCGA) to classify observations into Triple-Negative Breast Cancer (TNBC) and non-TNBC tissue samples. The pre-processed dataset consists of 16,600 genes and more than 1,000 samples. We demonstrate that ROSIE selects important features and outliers in a robust way. Identified outliers are concordant with the distribution of the commonly selected genes by the three methods, and results are in line with other independent studies. Furthermore, we discuss the association of some of the selected genes with the TNBC subtype in other investigations. In summary, ROSIE constitutes a robust and sparse procedure to identify outliers and important genes through binary classification. Our approach is ad hoc applicable to other datasets, fulfilling the overall goal of simultaneously identifying outliers and candidate disease biomarkers to the targeted in therapy research and personalized medicine frameworks.

Keywords: Ensemble; biomarker; classification; feature selection; outlier; robust; sparse; triple-Negative Breast Cancer.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Figures

Figure 1.
Figure 1.
ROSIE workflow and robust and sparse classification methods. A) Three robust and sparse methods perform classification on the dataset. Each method provides an outlier ranking and selected features. Rankings are combined to acquire an outlier list. Important features are taken as the intersection of all three selected feature sets. Validity of the method is assessed by repeatedly classifying bootstrap sampled datasets and comparing the results with the main part. B) Simplified representation of the underlying classification methods, i.e., sparse robust discriminant analysis with sparse partial robust M regression (SPRM), robust and sparse K-means clustering (RSK-means) and robust and sparse logistic regression with elastic net penalty (enetLTS) for exemplary data comprising two classes and two features ( p1,p2 ).
Figure 2.
Figure 2.
ROC curves for simulation study results. Results comparing ROSIE with single methods for three outlier settings. Average AUC values: ROSIE (0.81), ENET (0.79), SPRM (0.76), RSKC (0.65).
Figure 3.
Figure 3.
Correlation analysis of selected features. Heatmap of correlation values of the 54 commonly selected features.
Figure 4.
Figure 4.
Relation between influential samples and commonly selected genes. Estimated densities of gene expression of selected features grouped by TNBC (green dashed line) and non-TNBC (red line). Vertical lines represent respective group medians. Blue markers depict influential samples.
Figure 5.
Figure 5.
Venn diagrams comparing different classification approaches. Comparison of identified outliers (left) and selected genes (right) from ROSIE, the sparse Ensemble approach by Lopes et al. and the robust approach enetLTS by Segaert et al. .

References

    1. Thomas RS, Wesselkamper SC, Wang NCY. et al.. Temporal concordance between apical and transcriptional points of departure for chemical risk assessment. Toxicol Sci 2013; 134: 180–194. DOI: 10.1093/toxsci/kft094. https://academic.oup.com/toxsci/article-pdf/134/1/180/16685755/kft094.pdf . - DOI - PubMed
    1. Sutherland J, Webster Y, Willy J. et al.. Toxicogenomic module associations with pathogenesis: a network-based approach to understanding drug toxicity. Pharmacogenomics J 2018; 18: 377–390. - PubMed
    1. Zhang X, Yap Y, Wei D. et al.. Novel omics technologies in nutrition research. Biotechnol Adv 2008; 26: 169–176. DOI: 10.1016/j.biotechadv.2007.11.002. https://www.sciencedirect.com/science/article/pii/S0734975007001206 . - DOI - PubMed
    1. Kato H, Takahashi S, Saito K. Omics and integrated omics for the promotion of food and nutrition science. J Tradit Complement Med 2011; 1: 25–30. DOI: 10.1016/S2225-4110(16)30053-0. https://www.sciencedirect.com/science/article/pii/S2225411016300530 . - DOI - PMC - PubMed
    1. Kan M, Shumyatcher M, Himes BE. Using omics approaches to understand pulmonary diseases. Respir Res 2017; 18: 1–20. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources