Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality
- PMID: 31208324
- PMCID: PMC6580485
- DOI: 10.1186/s12859-019-2929-8
Machine learning with the TCGA-HNSC dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality
Abstract
Background: In the era of precision oncology and publicly available datasets, the amount of information available for each patient case has dramatically increased. From clinical variables and PET-CT radiomics measures to DNA-variant and RNA expression profiles, such a wide variety of data presents a multitude of challenges. Large clinical datasets are subject to sparsely and/or inconsistently populated fields. Corresponding sequencing profiles can suffer from the problem of high-dimensionality, where making useful inferences can be difficult without correspondingly large numbers of instances. In this paper we report a novel deployment of machine learning techniques to handle data sparsity and high dimensionality, while evaluating potential biomarkers in the form of unsupervised transformations of RNA data. We apply preprocessing, MICE imputation, and sparse principal component analysis (SPCA) to improve the usability of more than 500 patient cases from the TCGA-HNSC dataset for enhancing future oncological decision support for Head and Neck Squamous Cell Carcinoma (HNSCC).
Results: Imputation was shown to improve prognostic ability of sparse clinical treatment variables. SPCA transformation of RNA expression variables reduced runtime for RNA-based models, though changes to classifier performance were not significant. Gene ontology enrichment analysis of gene sets associated with individual sparse principal components (SPCs) are also reported, showing that both high- and low-importance SPCs were associated with cell death pathways, though the high-importance gene sets were found to be associated with a wider variety of cancer-related biological processes.
Conclusions: MICE imputation allowed us to impute missing values for clinically informative features, improving their overall importance for predicting two-year recurrence-free survival by incorporating variance from other clinical variables. Dimensionality reduction of RNA expression profiles via SPCA reduced both computation cost and model training/evaluation time without affecting classifier performance, allowing researchers to obtain experimental results much more quickly. SPCA simultaneously provided a convenient avenue for consideration of biological context via gene ontology enrichment analysis.
Keywords: Decision support; Dimensionality reduction; Gene ontology enrichment analysis; Machine learning; Unsupervised transformation; hnscc; tcga.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures




Similar articles
-
Screening key lncRNAs with diagnostic and prognostic value for head and neck squamous cell carcinoma based on machine learning and mRNA-lncRNA co-expression network analysis.Cancer Biomark. 2020;27(2):195-206. doi: 10.3233/CBM-190694. Cancer Biomark. 2020. PMID: 31815689
-
Identification of Potential Biomarkers and Survival Analysis for Head and Neck Squamous Cell Carcinoma Using Bioinformatics Strategy: A Study Based on TCGA and GEO Datasets.Biomed Res Int. 2019 Aug 7;2019:7376034. doi: 10.1155/2019/7376034. eCollection 2019. Biomed Res Int. 2019. PMID: 31485443 Free PMC article.
-
One-slice CT image based kernelized radiomics model for the prediction of low/mid-grade and high-grade HNSCC.Comput Med Imaging Graph. 2020 Mar;80:101675. doi: 10.1016/j.compmedimag.2019.101675. Epub 2019 Dec 23. Comput Med Imaging Graph. 2020. PMID: 31945637
-
An update on advanced dual-energy CT for head and neck cancer imaging.Expert Rev Anticancer Ther. 2019 Jul;19(7):633-644. doi: 10.1080/14737140.2019.1626234. Epub 2019 Jun 21. Expert Rev Anticancer Ther. 2019. PMID: 31177872 Review.
-
Gene Expression Signatures for Head and Neck Cancer Patient Stratification: Are Results Ready for Clinical Application?Curr Treat Options Oncol. 2017 May;18(5):32. doi: 10.1007/s11864-017-0472-2. Curr Treat Options Oncol. 2017. PMID: 28474265 Review.
Cited by
-
The Application of Deep Learning in Cancer Prognosis Prediction.Cancers (Basel). 2020 Mar 5;12(3):603. doi: 10.3390/cancers12030603. Cancers (Basel). 2020. PMID: 32150991 Free PMC article. Review.
-
Improved clinical data imputation via classical and quantum determinantal point processes.Elife. 2024 May 9;12:RP89947. doi: 10.7554/eLife.89947. Elife. 2024. PMID: 38722146 Free PMC article.
-
An up-to-date overview of computational polypharmacology in modern drug discovery.Expert Opin Drug Discov. 2020 Sep;15(9):1025-1044. doi: 10.1080/17460441.2020.1767063. Epub 2020 May 26. Expert Opin Drug Discov. 2020. PMID: 32452701 Free PMC article. Review.
-
Ontologies and Knowledge Graphs in Oncology Research.Cancers (Basel). 2022 Apr 10;14(8):1906. doi: 10.3390/cancers14081906. Cancers (Basel). 2022. PMID: 35454813 Free PMC article. Review.
-
A Transcriptomic Analysis of Head and Neck Squamous Cell Carcinomas for Prognostic Indications.J Pers Med. 2021 Aug 11;11(8):782. doi: 10.3390/jpm11080782. J Pers Med. 2021. PMID: 34442426 Free PMC article.
References
-
- Liu J, Wu Y, Wang Q, Liu X, Liao X, Pan J. Bioinformatic analysis of PFN2 dysregulation and its prognostic value in head and neck squamous carcinoma. (1744–8301 (Electronic)). 2018. - PubMed
-
- Huang H, Lin C, Yang C, Ho C, Chang Y, Chang J, editors. An integrative analysis for Cancer studies. 2016 IEEE 16th international conference on bioinformatics and bioengineering (BIBE); 2016 31 Oct.-2 Nov. 2016.
MeSH terms
Substances
Grants and funding
LinkOut - more resources
Full Text Sources