Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 20;24(Suppl 4):175.
doi: 10.1186/s12911-024-02578-0.

Robust and consistent biomarker candidates identification by a machine learning approach applied to pancreatic ductal adenocarcinoma metastasis

Affiliations

Robust and consistent biomarker candidates identification by a machine learning approach applied to pancreatic ductal adenocarcinoma metastasis

Tanakamol Mahawan et al. BMC Med Inform Decis Mak. .

Abstract

Background: Machine Learning (ML) plays a crucial role in biomedical research. Nevertheless, it still has limitations in data integration and irreproducibility. To address these challenges, robust methods are needed. Pancreatic ductal adenocarcinoma (PDAC), a highly aggressive cancer with low early detection rates and survival rates, is used as a case study. PDAC lacks reliable diagnostic biomarkers, especially metastatic biomarkers, which remains an unmet need. In this study, we propose an ML-based approach for discovering disease biomarkers, apply it to the identification of a PDAC metastatic composite biomarker candidate, and demonstrate the advantages of harnessing data resources.

Methods: We utilised primary tumour RNAseq data from five public repositories, pooling samples to maximise statistical power and integrating data by correcting for technical variance. Data were split into train and validation sets. The train dataset underwent variable selection via a 10-fold cross-validation process that combined three algorithms in 100 models per fold. Genes found in at least 80% of models and five folds were considered robust to build a consensus multivariate model. A random forest model was constructed using selected genes from the train dataset and tested in the validation set. We also assessed the goodness of prediction by recalibrating a model using only the validation data. The biological context and relevance of signals was explored through enrichment and pathway analyses using QIAGEN Ingenuity Pathway Analysis and GeneMANIA.

Results: We developed a pipeline that can detect robust signatures to build composite biomarkers. We tested the pipeline in PDAC, exploiting transcriptomics data from different sources, proposing a composite biomarker candidate comprised of fifteen genes consistently selected that showed very promising predictive capability. Biological contextualisation revealed links with cancer progression and metastasis, underscoring their potential relevance. All code is available in GitHub.

Conclusion: This study establishes a robust framework for identifying composite biomarkers across various disease contexts. We demonstrate its potential by proposing a plausible composite biomarker candidate for PDAC metastasis. By reusing data from public repositories, we highlight the sustainability of our research and the wider applications of our pipeline. The preliminary findings shed light on a promising validation and application path.

Keywords: Biomarker identification; Machine Learning; Metastasis; PDAC; Pancreatic cancer.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Analysis workflow, including data pre-processing, variable selection, re-sampling technique, and classification model
Fig. 2
Fig. 2
Biological contextualisation using GeneMANIA and QIAGEN IPA: set 1 showed immune-related pathways enrichment, while sets 2 highlighted EMC remodelling & metabolic reprogramming and set 3 enriched angiogenesis
Fig. 3
Fig. 3
The complex and multifaceted interplay between immune response, extracellular matrix remodelling, metabolic reprogramming, and angiogenesis in cancer metastasis including PDAC (Biomarker candidates involved in biological processes are highlighted in yellow)
Fig. 4
Fig. 4
Proposed in-silico biomarker discovery workflow and checkpoints to consider (implemented methods in GitHub pipeline are bolded and underlined)

References

    1. Jean-Quartier C, Jeanquartier F, Jurisica I, Holzinger A. In silico cancer research towards 3R. BMC Cancer. 2018;18:1–12. doi: 10.1186/s12885-018-4302-0. - DOI - PMC - PubMed
    1. Dhillon A, Singh A, Bhalla VK. A systematic review on biomarker identification for cancer diagnosis and prognosis in multi-omics: from computational needs to machine learning and deep learning. Arch Comput Methods Eng. 2023;30:917–949. doi: 10.1007/s11831-022-09821-9. - DOI
    1. Strzelecki M, Badura P. Machine learning for biomedical application. Appl Sci. 2022;12:2022. doi: 10.3390/app12042022. - DOI
    1. Kapoor S, Narayanan A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns. 2023;4:100804. doi: 10.1016/j.patter.2023.100804. - DOI - PMC - PubMed
    1. Riley RD, Pate A, Dhiman P, Archer L, Martin GP, Collins GS. Clinical prediction models and the multiverse of madness. BMC Med. 2023;21:502. doi: 10.1186/s12916-023-03212-y. - DOI - PMC - PubMed

MeSH terms

Substances