. 2018 Apr 11;19(Suppl 5):118.

doi: 10.1186/s12859-018-2095-4.

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

Yang Guo¹, Shuhui Liu¹, Zhanhuai Li¹, Xuequn Shang²

Affiliations

¹ School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China.
² School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China. shang@nwpu.edu.cn.

PMID: 29671390
PMCID: PMC5907304
DOI: 10.1186/s12859-018-2095-4

BCDForest: a boosting cascade deep forest model towards the classification of cancer subtypes based on gene expression data

Yang Guo et al. BMC Bioinformatics. 2018.

. 2018 Apr 11;19(Suppl 5):118.

doi: 10.1186/s12859-018-2095-4.

Authors

Yang Guo¹, Shuhui Liu¹, Zhanhuai Li¹, Xuequn Shang²

Affiliations

¹ School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China.
² School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an, 710072, People's Republic of China. shang@nwpu.edu.cn.

PMID: 29671390
PMCID: PMC5907304
DOI: 10.1186/s12859-018-2095-4

Abstract

Background: The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data.

Results: In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification.

Conclusions: The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.

Keywords: Cancer subtype; Cascade forest; Classification.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Illustration of cascade forest structure. Each level of the cascade consists of two random forests (black) and two completely random forests (red). Suppose there are three classes to predict; each forest outputs a three-dimensional class vector, which is then concatenated for representation of original input [23]

**Fig. 2**
Illustration of boosting cascade forest structure. Each level of the cascade consists of two random forests (black) and two completely random forests (red). The standard deviation of top-k important features in each forest will compose a new feature to be concatenated in the next cascade level to emphasize the discriminative features

**Fig. 3**
Illustration of multi-class-grained scanning. a Suppose four classes (A, B, C and D) in training dataset. For each class, we produce the positive and negative sub-datasets, and then use the sub-datasets to train a binary random forest classifier. Four different types of random forests will be produced by using different training datasets (sliding window based). The out-of-bagging (OOB) score of each forest is used to calculate a normalized quantity weight to each forest. b Based on the fit forests and their quantity weights, a 500-dim instance vector can be transformed to a concatenated 1604-dim representation

**Fig. 4**
Overall procedure of BCDForest. Suppose there are four classes, and the sliding windows are 100-dim and 200-dim. Two cascade layers are used to give final prediction

**Fig. 5**
Comparison of different methods on large-scale pan-cancers dataset. Each dot presents the performance of each corresponding method on each cancer type. 11 cancer types were included in the pan-cancers dataset

**Fig. 6**
Comparison of BCDForest and gcForest on three cancer type datasets (BRCA, GBM and LUNG). Each dot presents the performance of each method on each cancer subtype class

**Fig. 7**
Comparison of overall performance of BCDForest and gcForest on BRCA, GBM and LUNG datasets. The average *precision*, *recall* and *F-1* score on all subtype classes of each dataset were evaluated

**Fig. 8**
Comparison of overall performance of BCDForest and gcForest on COAD datasets. The average *precision*, *recall* and *F-1* score on all subtype classes of each dataset were evaluated

**Fig. 9**
Comparison of overall performance of BCDForest and gcForest on LIHC datasets. The average *precision*, *recall* and *F-1* score on all subtype classes of each dataset were evaluated

See this image and copyright information in PMC

Cited by

Comparing biological information contained in mRNA and non-coding RNAs for classification of lung cancer patients.
Smolander J, Stupnikov A, Glazko G, Dehmer M, Emmert-Streib F. Smolander J, et al. BMC Cancer. 2019 Dec 3;19(1):1176. doi: 10.1186/s12885-019-6338-1. BMC Cancer. 2019. PMID: 31796020 Free PMC article.
Hope4Genes: a Hopfield-like class prediction algorithm for transcriptomic data.
Cantini L, Caselle M. Cantini L, et al. Sci Rep. 2019 Jan 23;9(1):337. doi: 10.1038/s41598-018-36744-y. Sci Rep. 2019. PMID: 30674955 Free PMC article.
Machine learning in the prediction of cardiac surgery associated acute kidney injury with early postoperative biomarkers.
Fan R, Qin W, Zhang H, Guan L, Wang W, Li J, Chen W, Huang F, Zhang H, Chen X. Fan R, et al. Front Surg. 2023 Feb 7;10:1048431. doi: 10.3389/fsurg.2023.1048431. eCollection 2023. Front Surg. 2023. PMID: 36824496 Free PMC article.
Kernel principal components based cascade forest towards disease identification with human microbiota.
Zhou J, Ye Y, Jiang J. Zhou J, et al. BMC Med Inform Decis Mak. 2021 Dec 23;21(1):360. doi: 10.1186/s12911-021-01705-5. BMC Med Inform Decis Mak. 2021. PMID: 34949186 Free PMC article.
Multi-modality deep forest for hand motion recognition via fusing sEMG and acceleration signals.
Fang Y, Lu H, Liu H. Fang Y, et al. Int J Mach Learn Cybern. 2023;14(4):1119-1131. doi: 10.1007/s13042-022-01687-4. Epub 2022 Nov 1. Int J Mach Learn Cybern. 2023. PMID: 36339898 Free PMC article.

See all "Cited by" articles

References

1. Stingl J, Caldas C. Molecular heterogeneity of breast carcinomas and the cancer stem cell hypothesis. Nat Rev Cancer. 2007;7(10):791–799. doi: 10.1038/nrc2212. - DOI - PubMed
1. Bianchini G, Iwamoto T, Qi Y, Coutant C, Shiang CY, Wang B, Santarpia L, Valero V, Hortobagyi GN, Symmans WF, et al. Prognostic and therapeutic implications of distinct kinase expression patterns in different subtypes of breast cancer. Cancer Res. 2010;70(21):8852–8862. doi: 10.1158/0008-5472.CAN-10-1039. - DOI - PubMed
1. Heiser LM, Sadanandam A, Kuo WL, Benz SC, Goldstein TC, Ng S, Gibb WJ, Wang NJ, Ziyad S, Tong F, et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc Natl Acad Sci U S A. 2012;109(8):2724–2729. doi: 10.1073/pnas.1018854108. - DOI - PMC - PubMed
1. Prat A, Parker JS, Karginova O, Fan C, Livasy C, Herschkowitz JI, He X, Perou CM. Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res. 2010;12(5):R68. doi: 10.1186/bcr2635. - DOI - PMC - PubMed
1. Jahid MJ, Huang TH, Ruan J. A personalized committee classification approach to improving prediction of breast cancer metastasis. Bioinformatics. 2014;30(13):1858–1866. doi: 10.1093/bioinformatics/btu128. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed