Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Soumya Deep Roy¹, Soham Das¹, Devroop Kar², Friedhelm Schwenker³, Ram Sarkar²

Affiliations

¹ Department of Metallurgical and Material Engineering, Jadavpur University, Kolkata 700032, India.
² Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.
³ Institute of Neural Information Processing, Ulm University, 89081 Ulm, Germany.

PMID: 34071029
PMCID: PMC8197148
DOI: 10.3390/s21113628

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Soumya Deep Roy et al. Sensors (Basel). 2021.

. 2021 May 23;21(11):3628.

doi: 10.3390/s21113628.

Authors

Soumya Deep Roy¹, Soham Das¹, Devroop Kar², Friedhelm Schwenker³, Ram Sarkar²

Affiliations

¹ Department of Metallurgical and Material Engineering, Jadavpur University, Kolkata 700032, India.
² Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.
³ Institute of Neural Information Processing, Ulm University, 89081 Ulm, Germany.

PMID: 34071029
PMCID: PMC8197148
DOI: 10.3390/s21113628

Abstract

Breast cancer, like most forms of cancer, is a fatal disease that claims more than half a million lives every year. In 2020, breast cancer overtook lung cancer as the most commonly diagnosed form of cancer. Though extremely deadly, the survival rate and longevity increase substantially with early detection and diagnosis. The treatment protocol also varies with the stage of breast cancer. Diagnosis is typically done using histopathological slides from which it is possible to determine whether the tissue is in the Ductal Carcinoma In Situ (DCIS) stage, in which the cancerous cells have not spread into the encompassing breast tissue, or in the Invasive Ductal Carcinoma (IDC) stage, wherein the cells have penetrated into the neighboring tissues. IDC detection is extremely time-consuming and challenging for physicians. Hence, this can be modeled as an image classification task where pattern recognition and machine learning can be used to aid doctors and medical practitioners in making such crucial decisions. In the present paper, we use an IDC Breast Cancer dataset that contains 277,524 images (with 78,786 IDC positive images and 198,738 IDC negative images) to classify the images into IDC(+) and IDC(-). To that end, we use feature extractors, including textural features, such as SIFT, SURF and ORB, and statistical features, such as Haralick texture features. These features are then combined to yield a dataset of 782 features. These features are ensembled by stacking using various Machine Learning classifiers, such as Random Forest, Extra Trees, XGBoost, AdaBoost, CatBoost and Multi Layer Perceptron followed by feature selection using Pearson Correlation Coefficient to yield a dataset with four features that are then used for classification. From our experimental results, we found that CatBoost yielded the highest accuracy (92.55%), which is at par with other state-of-the-art results-most of which employ Deep Learning architectures. The source code is available in the GitHub repository.

Keywords: IDC; breast cancer; ensemble learning; feature selection; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
(a) The classifier is trained on the training set. The trained classifier is then used to predict the outcome of the validation set as well as the test set. (b) The second stage wherein the validation predictions for different classifiers are stacked to generate the new training features while the test predictions for different classifiers are stacked to generate the new test features. Using these new training and test features, the classifier is evaluated. (c) Our model wherein the second stage is modified and we perform feature selection on the new training and test features before model training and testing.

**Figure 2**
Pipeline of the proposed model used for breast cancer detection from histology images. We start with the histopathological image from which we extract 256 SIFT, 256 SURF, 256 ORB and 14 Haralicks features. The 782 (= 256 + 256 + 256 + 14) features are then combined. These features are then ensembled by stacking. In order to weed out redundant features, we use Pearson’s Correlation Coefficient. This is followed by model training and testing, which eventually classifies the images as IDC(-) and IDC(+).

**Figure 3**
Sample images of the present dataset. (a,c,e) IDC(+) patches. (b,d,f) IDC(-) patches.

**Figure 4**
Pearson Correlation between features after stacking.

**Figure 5**
The Accuracy–Training Set Size Curve for the CB classifier. The curve plots the test accuracy for different training set sizes keeping the test set unaltered.

**Figure 6**
Patches that are misclassified by the proposed method. In the first column (a,c,e,g) are false positive errors. In the second column (b,d,f,h) are false negative errors.

See this image and copyright information in PMC

References

1. Feig S.A., Yaffe M.J. Digital mammography, computer-aided diagnosis, and telemammography. Radiol. Clin. N. Am. 1995;33:1205. - PubMed
1. Sung H., Ferlay J., Siegel R.L., Laversanne M., Soerjomataram I., Jemal A., Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021 doi: 10.3322/caac.21660. - DOI - PubMed
1. Lowe D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004;60:91–110. doi: 10.1023/B:VISI.0000029664.99615.94. - DOI
1. Bay H., Tuytelaars T., Van Gool L. SURF: Speeded up robust features; Proceedings of the 9th European Conference on Computer Vision; Graz, Austria. 7–13 May 2006; pp. 404–417. - DOI
1. Rublee E., Rabaud V., Konolige K., Bradski G. ORB: An efficient alternative to SIFT or SURF; Proceedings of the 2011 International Conference on Computer Vision; Barcelona, Spain. 6–13 November 2011; pp. 2564–2571. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Affiliations

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Medical