Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 3:9:619330.
doi: 10.3389/fcell.2021.619330. eCollection 2021.

Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-of-Origin

Affiliations

Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-of-Origin

Haiyan Liu et al. Front Cell Dev Biol. .

Abstract

Carcinoma of unknown primary (CUP) is a type of metastatic cancer, the primary tumor site of which cannot be identified. CUP occupies approximately 5% of cancer incidences in the United States with usually unfavorable prognosis, making it a big threat to public health. Traditional methods to identify the tissue-of-origin (TOO) of CUP like immunohistochemistry can only deal with around 20% CUP patients. In recent years, more and more studies suggest that it is promising to solve the problem by integrating machine learning techniques with big biomedical data involving multiple types of biomarkers including epigenetic, genetic, and gene expression profiles, such as DNA methylation. Different biomarkers play different roles in cancer research; for example, genomic mutations in a patient's tumor could lead to specific anticancer drugs for treatment; DNA methylation and copy number variation could reveal tumor tissue of origin and molecular classification. However, there is no systematic comparison on which biomarker is better at identifying the cancer type and site of origin. In addition, it might also be possible to further improve the inference accuracy by integrating multiple types of biomarkers. In this study, we used primary tumor data rather than metastatic tumor data. Although the use of primary tumors may lead to some biases in our classification model, their tumor-of-origins are known. In addition, previous studies have suggested that the CUP prediction model built from primary tumors could efficiently predict TOO of metastatic cancers (Lal et al., 2013; Brachtel et al., 2016). We systematically compared the performances of three types of biomarkers including DNA methylation, gene expression profile, and somatic mutation as well as their combinations in inferring the TOO of CUP patients. First, we downloaded the gene expression profile, somatic mutation and DNA methylation data of 7,224 tumor samples across 21 common cancer types from the cancer genome atlas (TCGA) and generated seven different feature matrices through various combinations. Second, we performed feature selection by the Pearson correlation method. The selected features for each matrix were used to build up an XGBoost multi-label classification model to infer cancer TOO, an algorithm proven to be effective in a few previous studies. The performance of each biomarker and combination was compared by the 10-fold cross-validation process. Our results showed that the TOO tracing accuracy using gene expression profile was the highest, followed by DNA methylation, while somatic mutation performed the worst. Meanwhile, we found that simply combining multiple biomarkers does not have much effect in improving prediction accuracy.

Keywords: DNA methylation; gene expression; multi-classifier XGBoost; pearson correlation algorithm; somatic mutation; tumor tissue-of-origin.

PubMed Disclaimer

Conflict of interest statement

BW, GT, and JY were employed by the company Genesis Beijing Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flow diagram of prediction on cancer tissue origin and performance evaluation. Seven different feature matrices, respectively, are gene expression feature matrix, somatic mutation feature matrix, DNA methylation feature matrix, both gene expression and somatic mutation feature matrix, both gene expression and DNA methylation feature matrix, both DNA methylation and somatic mutation feature matrix, and the feature matrix that combines these three biomarkers.
FIGURE 2
FIGURE 2
The classification accuracy of using gene expression, somatic mutation, DNA methylation, and combination of the three biomarkers, respectively, on each gene set.
FIGURE 3
FIGURE 3
The classification precisions, recall rates, and f1 scores for each biomarker combination on the 14 top-ranked genes for each cancer type. exp represents gene expression profiling, meth represents DNA methylation, and snp represents somatic mutation.
FIGURE 4
FIGURE 4
The precisions of XGBoost classifier using gene expression data on the 14 top-ranked genes for each cancer type. Precisions from 10 times of cross-validations were averaged.
FIGURE 5
FIGURE 5
GO and KEGG analysis. (A) Significantly enriched GO cellular component, biological process, and molecular function of selected 14 top-ranked genes of each cancer type in gene expression data. (B) Significantly enriched KEGG pathways of the selected 14 top-ranked genes of each cancer type in gene expression data. The dot plot shows the number of signature genes identified by enrichment analysis for each cell component, biological process, molecular function, and KEGG pathway. The dot size represents the number of genes enriched in specific pathways and the dot color represents adjusted enrichment p-value. (C) The tSNE visualization of all samples for the 21 tumor types. The x- and y-axis represent the first and second dimension of tSNE, respectively.

References

    1. Alhassan Mohammed H., Saboor-Yaraghi A. A., Vahedi H., Panahi G., Hemmasi G., Yekaninejad M. S., et al. (2018). Immunotherapeutic Effects of β-D mannuronic acid on IL-4, GATA3, IL-17 and RORC gene expression in the pbmc of patients with inflammatory bowel diseases. Iran J. Allergy Asthma Immunol. 17 308–317. 10.18502/ijaai.v17i4.90 - DOI - PubMed
    1. Bender R. A., Erlander M. G. (2009). Molecular classification of unknown primary cancer. Semin Oncol. 36 38–43. 10.1053/j.seminoncol.2008.10.002 - DOI - PubMed
    1. Brachtel E. F., Operaña T. N., Sullivan P. S., Kerr S. E., Schnabel C. A. (2016). Molecular classification of cancer with the 92-gene assay in cytology and limited tissue samples. Oncotarget 7 27220–27231. 10.18632/oncotarget.8449 - DOI - PMC - PubMed
    1. Christophi G. P., Rong R., Holtzapple P. G., Massa P. T., Landas S. K. (2012). Immune markers and differential signaling networks in ulcerative colitis and Crohn’s disease. Inflamm. Bowel Dis. 18 2342–2356. 10.1002/ibd.22957 - DOI - PMC - PubMed
    1. Conway A. M., Mitchell C., Kilgour E., Brady G., Dive C., Cook N. (2019). Molecular characterisation and liquid biomarkers in Carcinoma of Unknown Primary (CUP): taking the ‘U’ out of ‘CUP’. Br. J. Cancer 120 141–153. 10.1038/s41416-018-0332-2 - DOI - PMC - PubMed