Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 21:12:832567.
doi: 10.3389/fonc.2022.832567. eCollection 2022.

A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data

Affiliations

A Machine Learning Method to Trace Cancer Primary Lesion Using Microarray-Based Gene Expression Data

Qingfeng Lu et al. Front Oncol. .

Abstract

Cancer of unknown primary site (CUP) is a heterogeneous group of cancers whose tissue of origin remains unknown after detailed investigation by conventional clinical methods. The number of CUP accounts for roughly 3%-5% of all human malignancies. CUP patients are usually treated with broad-spectrum chemotherapy, which often leads to a poor prognosis. Recent studies suggest that the treatment targeting the primary lesion of CUP will significantly improve the prognosis of the patient. Therefore, it is urgent to develop an efficient method to accurately detect tissue of origin of CUP in clinical cancer research. In this work, we developed a novel framework that uses Extreme Gradient Boosting (XGBoost) to trace the primary site of CUP based on microarray-based gene expression data. First, we downloaded the microarray-based gene expression profiles of 59,385 genes for 57,08 samples from The Cancer Genome Atlas (TCGA) and 6,364 genes for 3,101 samples from the Gene Expression Omnibus (GEO). Both data were divided into training and independent testing data with a ratio of 4:1. Then, we obtained in the training data 200 and 290 genes from TCGA and the GEO datasets, respectively, to train XGBoost models for the identification of the primary site of CUP. The overall 5-fold cross-validation accuracies of our methods were 96.9% and 95.3% on TCGA and GEO training datasets, respectively. Meanwhile, the macro-precision for the independent dataset reached 96.75% and 98.8% on, respectively, TCGA and GEO. Experimental results demonstrated that the XGBoost framework not only can reduce the cost of clinical cancer traceability but also has high efficiency, which might be useful in clinical usage.

Keywords: XGBoost; cancer of the unknown primary site; gene expression; gene selection; human malignancies.

PubMed Disclaimer

Conflict of interest statement

The authors GT and QYL are employed by Genesis (Beijing) Co. Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
A computational framework to detect the primary site of cancer with an unknown primary lesion.
Figure 2
Figure 2
The number of samples for each cancer type. (A) T dataset. (B) G dataset.
Figure 3
Figure 3
Performance of the model with top x genes in 5-fold cross-validation. (A) T dataset and (B) G dataset.
Figure 4
Figure 4
Expression of selected genes in individual cancer types. (A) T dataset. (B) G dataset.
Figure 5
Figure 5
Comparison of machine learning models for independent testing on the (A) T and (B) G datasets.
Figure 6
Figure 6
ROC and AUC of XGBoost model in each cancer on test datasets. ROC, receiver operating characteristic; AUC, area under the receiver operating characteristic curve; XGBoost, Extreme Gradient Boosting.
Figure 7
Figure 7
GO and KEGG enrichment analyses of the 200 genes on the T dataset (A) and 390 genes on the G dataset (B). GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.
Figure 8
Figure 8
Protein–protein interaction network. The MCODE algorithm was then applied to this network to identify neighborhoods where proteins are densely connected. Each MCODE network is assigned a unique color. The GO enrichment analysis was applied to each MCODE network to assign “meaning” to the network component. GO, Gene Ontology.

References

    1. Varadhachary GR, Raber MN. Cancer of Unknown Primary Site. N Engl J Med (2014) 371:757–65. doi: 10.1056/NEJMra1303917 - DOI - PubMed
    1. Pavlidis N, Khaled H, Gaafar R. A Mini Review on Cancer of Unknown Primary Site: A Clinical Puzzle for the Oncologists. J Adv Res (2015) 6:375–82. doi: 10.1016/j.jare.2014.11.007 - DOI - PMC - PubMed
    1. Sondergaard D, Nielsen S, Pedersen CNS, Besenbacher S. Prediction of Primary Tumors in Cancers of Unknown Primary. J Integr Bioinf (2017) 14:20170013. doi: 10.1515/jib-2017-0013 - DOI - PMC - PubMed
    1. Ma XJ, Patel R, Wang XQ, Salunga R, Murage J, Desai R, et al. Molecular Classification of Human Cancers Using a 92-Gene Real-Time Quantitative Polymerase Chain Reaction Assay. Arch Pathol Lab Med (2006) 130:465–73. doi: 10.5858/2006-130-465-MCOHCU - DOI - PubMed
    1. Ma X, Xi B, Zhang Y, Zhu L, Sui X, Tian G, et al. A Machine Learning-Based Diagnosis of Thyroid Cancer Using Thyroid Nodules Ultrasound Images. Curr Bioinf (2020) 15:349–58. doi: 10.2174/1574893614666191017091959 - DOI

LinkOut - more resources