Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 May 4;19(1):168.
doi: 10.1186/s12859-018-2149-7.

Ensemble outlier detection and gene selection in triple-negative breast cancer data

Affiliations

Ensemble outlier detection and gene selection in triple-negative breast cancer data

Marta B Lopes et al. BMC Bioinformatics. .

Abstract

Background: Learning accurate models from 'omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals' outlierness based on the Cook's distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level.

Results: We applied this strategy for the classification of Triple-Negative Breast Cancer (TNBC) RNA-Seq and clinical data from the Cancer Genome Atlas (TCGA). The detected 24 outliers were identified as putative mislabeled samples, corresponding to individuals with discrepant clinical labels for the HER2 receptor, but also individuals with abnormal expression values of ER, PR and HER2, contradictory with the corresponding clinical labels, which may invalidate the initial TNBC label. Moreover, the model consensus approach leads to the selection of a set of genes that may be linked to the disease. These results are robust to a resampling approach, either by selecting a subset of patients or a subset of genes, with a significant overlap of the outlier patients identified.

Conclusions: The proposed ensemble outlier detection approach constitutes a robust procedure to identify abnormal cases and consensus covariates, which may improve biomarker selection for precision medicine applications. The method can also be easily extended to other regression models and datasets.

Keywords: Ensemble modeling; High-dimensionality; Outlier detection; Rank Product test; Triple-negative breast cancer.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Individuals’ distributions in the space spanned by the first two SPLS-DA latent vectors. Circles, non-TNBC individuals; triangles, TNBC individuals; blue data points are influential observations; red data points are influential observations which are suspect regarding their HER2 label
Fig. 2
Fig. 2
Individuals’ distributions in the space spanned by the first two Principal Components. a symbols correspond to actual labels: circles, non-TNBC individuals; triangles, TNBC individuals; blue data points are influential observations; red data points are influential observations which are suspect regarding their HER2 label. b symbols correspond to predicted labels by the EM algorithm: circles, non-TNBC individuals; triangles, TNBC individuals; red data points are actual non-TNBC observations, for which at least one of the 3 TNBC-associated genes has an arguably high expression value

Similar articles

Cited by

References

    1. Katsnelson A. Momentum grows to make ‘personalized’ medicine more ‘precise’. Nat Med. 2013;19(3):249. doi: 10.1038/nm0313-249. - DOI - PubMed
    1. Basu B, Basu S. Correlating and combining genomic and proteomic assessment with in vivo molecular functional imaging: Will this be the future roadmap for personalized cancer management? Nat Med. 2016;31(3):75–84. - PubMed
    1. Vucic EA, Thu KL, Robison K, Rybaczyk LA, Chari R, Alvarez CE, Lam WL. Translating cancer ‘omics’ to improved outcome. Genome Res. 2012;22:188–95. doi: 10.1101/gr.124354.111. - DOI - PMC - PubMed
    1. Zhang W, Wan Y-W, Allen GI, Pang K, Anderson ML, Liu Z. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 2013;14(Suppl 8):7. doi: 10.1186/1471-2164-14-S8-S7. - DOI - PMC - PubMed
    1. Aggarwal CC. Outlier ensembles [position paper] ACM SIGKDD Explor. 2012;14(49-58):2.

Publication types