Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 23;21(11):3628.
doi: 10.3390/s21113628.

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Affiliations

Computer Aided Breast Cancer Detection Using Ensembling of Texture and Statistical Image Features

Soumya Deep Roy et al. Sensors (Basel). .

Abstract

Breast cancer, like most forms of cancer, is a fatal disease that claims more than half a million lives every year. In 2020, breast cancer overtook lung cancer as the most commonly diagnosed form of cancer. Though extremely deadly, the survival rate and longevity increase substantially with early detection and diagnosis. The treatment protocol also varies with the stage of breast cancer. Diagnosis is typically done using histopathological slides from which it is possible to determine whether the tissue is in the Ductal Carcinoma In Situ (DCIS) stage, in which the cancerous cells have not spread into the encompassing breast tissue, or in the Invasive Ductal Carcinoma (IDC) stage, wherein the cells have penetrated into the neighboring tissues. IDC detection is extremely time-consuming and challenging for physicians. Hence, this can be modeled as an image classification task where pattern recognition and machine learning can be used to aid doctors and medical practitioners in making such crucial decisions. In the present paper, we use an IDC Breast Cancer dataset that contains 277,524 images (with 78,786 IDC positive images and 198,738 IDC negative images) to classify the images into IDC(+) and IDC(-). To that end, we use feature extractors, including textural features, such as SIFT, SURF and ORB, and statistical features, such as Haralick texture features. These features are then combined to yield a dataset of 782 features. These features are ensembled by stacking using various Machine Learning classifiers, such as Random Forest, Extra Trees, XGBoost, AdaBoost, CatBoost and Multi Layer Perceptron followed by feature selection using Pearson Correlation Coefficient to yield a dataset with four features that are then used for classification. From our experimental results, we found that CatBoost yielded the highest accuracy (92.55%), which is at par with other state-of-the-art results-most of which employ Deep Learning architectures. The source code is available in the GitHub repository.

Keywords: IDC; breast cancer; ensemble learning; feature selection; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
(a) The classifier is trained on the training set. The trained classifier is then used to predict the outcome of the validation set as well as the test set. (b) The second stage wherein the validation predictions for different classifiers are stacked to generate the new training features while the test predictions for different classifiers are stacked to generate the new test features. Using these new training and test features, the classifier is evaluated. (c) Our model wherein the second stage is modified and we perform feature selection on the new training and test features before model training and testing.
Figure 1
Figure 1
(a) The classifier is trained on the training set. The trained classifier is then used to predict the outcome of the validation set as well as the test set. (b) The second stage wherein the validation predictions for different classifiers are stacked to generate the new training features while the test predictions for different classifiers are stacked to generate the new test features. Using these new training and test features, the classifier is evaluated. (c) Our model wherein the second stage is modified and we perform feature selection on the new training and test features before model training and testing.
Figure 2
Figure 2
Pipeline of the proposed model used for breast cancer detection from histology images. We start with the histopathological image from which we extract 256 SIFT, 256 SURF, 256 ORB and 14 Haralicks features. The 782 (= 256 + 256 + 256 + 14) features are then combined. These features are then ensembled by stacking. In order to weed out redundant features, we use Pearson’s Correlation Coefficient. This is followed by model training and testing, which eventually classifies the images as IDC(-) and IDC(+).
Figure 3
Figure 3
Sample images of the present dataset. (a,c,e) IDC(+) patches. (b,d,f) IDC(-) patches.
Figure 4
Figure 4
Pearson Correlation between features after stacking.
Figure 5
Figure 5
The Accuracy–Training Set Size Curve for the CB classifier. The curve plots the test accuracy for different training set sizes keeping the test set unaltered.
Figure 6
Figure 6
Patches that are misclassified by the proposed method. In the first column (a,c,e,g) are false positive errors. In the second column (b,d,f,h) are false negative errors.

Similar articles

Cited by

References

    1. Feig S.A., Yaffe M.J. Digital mammography, computer-aided diagnosis, and telemammography. Radiol. Clin. N. Am. 1995;33:1205. - PubMed
    1. Sung H., Ferlay J., Siegel R.L., Laversanne M., Soerjomataram I., Jemal A., Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2021 doi: 10.3322/caac.21660. - DOI - PubMed
    1. Lowe D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004;60:91–110. doi: 10.1023/B:VISI.0000029664.99615.94. - DOI
    1. Bay H., Tuytelaars T., Van Gool L. SURF: Speeded up robust features; Proceedings of the 9th European Conference on Computer Vision; Graz, Austria. 7–13 May 2006; pp. 404–417. - DOI
    1. Rublee E., Rabaud V., Konolige K., Bradski G. ORB: An efficient alternative to SIFT or SURF; Proceedings of the 2011 International Conference on Computer Vision; Barcelona, Spain. 6–13 November 2011; pp. 2564–2571. - DOI