Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jan 12;20(2):296.
doi: 10.3390/ijms20020296.

High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer

Affiliations

High-Throughput Omics and Statistical Learning Integration for the Discovery and Validation of Novel Diagnostic Signatures in Colorectal Cancer

Nguyen Phuoc Long et al. Int J Mol Sci. .

Abstract

The advancement of bioinformatics and machine learning has facilitated the discovery and validation of omics-based biomarkers. This study employed a novel approach combining multi-platform transcriptomics and cutting-edge algorithms to introduce novel signatures for accurate diagnosis of colorectal cancer (CRC). Different random forests (RF)-based feature selection methods including the area under the curve (AUC)-RF, Boruta, and Vita were used and the diagnostic performance of the proposed biosignatures was benchmarked using RF, logistic regression, naïve Bayes, and k-nearest neighbors models. All models showed satisfactory performance in which RF appeared to be the best. For instance, regarding the RF model, the following were observed: mean accuracy 0.998 (standard deviation (SD) < 0.003), mean specificity 0.999 (SD < 0.003), and mean sensitivity 0.998 (SD < 0.004). Moreover, proposed biomarker signatures were highly associated with multifaceted hallmarks in cancer. Some biomarkers were found to be enriched in epithelial cell signaling in Helicobacter pylori infection and inflammatory processes. The overexpression of TGFBI and S100A2 was associated with poor disease-free survival while the down-regulation of NR5A2, SLC4A4, and CD177 was linked to worse overall survival of the patients. In conclusion, novel transcriptome signatures to improve the diagnostic accuracy in CRC are introduced for further validations in various clinical settings.

Keywords: biomarker; colorectal cancer; diagnosis; machine learning; transcriptomics; variable selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Workflow of the biomarker candidate selection. (a) The process of selecting and validating diagnostic candidates with three different variable selection algorithms. (b) The Venn diagram demonstrating the relationships of selected biomarker candidates among three methods. VarSel: Variable selection; RF: Random Forest; AUCRF: the area under the curve (AUC)-RF; nBayes: naïve Bayes; logit: logistic regression; kNN: k-nearest neighbors.
Figure 2
Figure 2
Data exploration of three sets of biomarker candidates. (a) Principal component analysis of three sets of biomarker candidates between cancer samples and non-cancerous samples. (b) Heatmap analysis of three sets of biomarker candidates between cancer samples and non-cancerous samples. TCGA-READ: normal rectum, TCGA-COAD: normal colon, TCGA-T-READ: rectum adenocarcinoma, TCGA-T-COAD: colon adenocarcinoma, GTEx: normal colon and rectum.
Figure 3
Figure 3
Correlation analysis of biomarker candidates of cancer samples and non-cancerous samples. (a) Correlation network of biomarkers in cancer samples and non-cancerous samples. Blurred edges in the network were the ones with correlation strength (in absolute value) below the cut-off value 0.7. The blue color indicates positive correlations while red color indicates negative correlations (b) Correlation matrix of biomarkers in cancer samples and non-cancerous samples.
Figure 4
Figure 4
Performance metrics of classification models and variable importance scores from three tested signatures. (a) Accuracy, sensitivity, and specificity of various machine learning classification models. (b) Top 10 most important candidates of the random forests models.
Figure 5
Figure 5
Overall survival and disease-free survival analysis of TGFBI, S100A2, NR5A2, SLC4A4, and CD177.

References

    1. Bray F., Ferlay J., Soerjomataram I., Siegel R.L., Torre L.A., Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018 doi: 10.3322/caac.21492. - DOI - PubMed
    1. Miller K.D., Siegel R.L., Lin C.C., Mariotto A.B., Kramer J.L., Rowland J.H., Stein K.D., Alteri R., Jemal A. Cancer treatment and survivorship statistics. CA Cancer J. Clin. 2016;66:271–289. doi: 10.3322/caac.21349. - DOI - PubMed
    1. Bhardwaj M., Gies A., Werner S., Schrotz-King P., Brenner H. Blood-Based Protein Signatures for Early Detection of Colorectal Cancer: A Systematic Review. Clin. Transl. Gastroenterol. 2017;8:e128. doi: 10.1038/ctg.2017.53. - DOI - PMC - PubMed
    1. Hibner G., Kimsa-Furdzik M., Francuz T. Relevance of MicroRNAs as Potential Diagnostic and Prognostic Markers in Colorectal Cancer. Int. J. Mol. Sci. 2018;19:2944. doi: 10.3390/ijms19102944. - DOI - PMC - PubMed
    1. Rubin G., Walter F., Emery J., de Wit N. Reimagining the diagnostic pathway for gastrointestinal cancer. Nat. Rev. Gastroenterol. Hepatol. 2018;15:181. doi: 10.1038/nrgastro.2018.1. - DOI - PubMed

MeSH terms