Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2023 Mar 23;24(1):112.
doi: 10.1186/s12859-023-05235-x.

A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application

Affiliations
Review

A review and comparative study of cancer detection using machine learning: SBERT and SimCSE application

Mpho Mokoatle et al. BMC Bioinformatics. .

Abstract

Background: Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer.

Methods: In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings.

Results: The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE's sentence transformer only marginally improved the performance of machine learning models.

Keywords: Cancer detection; DNA; Machine learning; SentenceBert; SimCSE.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Generalized machine learning framework for lung cancer prediction [33]
Fig. 2
Fig. 2
Generalized machine learning framework for breast cancer prediction [45]
Fig. 3
Fig. 3
Generalized machine learning framework for prostate cancer prediction using 3-d CNNs, pooling layers, and a fully connected layer for classification [69]
Fig. 4
Fig. 4
Using a deep CNN network to predict colorectal cancer outcome using images [86]
Fig. 5
Fig. 5
SBERT architecture with classification objective function (left) and the regression objective function (right) [105]
Fig. 6
Fig. 6
Unsupervised SimCSE (a) and supervised SimCSE (b) [110]
Fig. 7
Fig. 7
Visualisation of the SBERT documents with k-means clustering
Fig. 8
Fig. 8
Visualisation of the SimCSE documents with k-means clustering
Fig. 9
Fig. 9
Confusion matrix of the LightGBM model using SBERT representations after SMOTE (dev set)
Fig. 10
Fig. 10
Confusion matrix of the XGBoost model using SBERT representations after SMOTE (dev set)
Fig. 11
Fig. 11
Confusion matrix of the LightGBM model using SimCSE representations after SMOTE (dev set)
Fig. 12
Fig. 12
Confusion matrix of the Random forest model using SimCSE representations after SMOTE (dev set)

References

    1. Jones PA, Baylin SB. The epigenomics of cancer. Cell. 2007;128(4):683–692. doi: 10.1016/j.cell.2007.01.029. - DOI - PMC - PubMed
    1. What Is Cancer? National Cancer Institute. https://www.cancer.gov/about-cancer/understanding/what-is-cancer
    1. Zheng R, Sun K, Zhang S, Zeng H, Zou X, Chen R, Gu X, Wei W, He J. Report of cancer epidemiology in china, 2015. Zhonghua zhong liu za zhi. 2019;41(1):19–28. - PubMed
    1. Hegde PS, Chen DS. Top 10 challenges in cancer immunotherapy. Immunity. 2020;52(1):17–35. doi: 10.1016/j.immuni.2019.12.011. - DOI - PubMed
    1. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8–17. doi: 10.1016/j.csbj.2014.11.005. - DOI - PMC - PubMed