Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Jun 25;11(1):13323.
doi: 10.1038/s41598-021-92725-8.

Lung adenocarcinoma and lung squamous cell carcinoma cancer classification, biomarker identification, and gene expression analysis using overlapping feature selection methods

Affiliations

Lung adenocarcinoma and lung squamous cell carcinoma cancer classification, biomarker identification, and gene expression analysis using overlapping feature selection methods

Joe W Chen et al. Sci Rep. .

Abstract

Lung cancer is one of the deadliest cancers in the world. Two of the most common subtypes, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC), have drastically different biological signatures, yet they are often treated similarly and classified together as non-small cell lung cancer (NSCLC). LUAD and LUSC biomarkers are scarce, and their distinct biological mechanisms have yet to be elucidated. To detect biologically relevant markers, many studies have attempted to improve traditional machine learning algorithms or develop novel algorithms for biomarker discovery. However, few have used overlapping machine learning or feature selection methods for cancer classification, biomarker identification, or gene expression analysis. This study proposes to use overlapping traditional feature selection or feature reduction techniques for cancer classification and biomarker discovery. The genes selected by the overlapping method were then verified using random forest. The classification statistics of the overlapping method were compared to those of the traditional feature selection methods. The identified biomarkers were validated in an external dataset using AUC and ROC analysis. Gene expression analysis was then performed to further investigate biological differences between LUAD and LUSC. Overall, our method achieved classification results comparable to, if not better than, the traditional algorithms. It also identified multiple known biomarkers, and five potentially novel biomarkers with high discriminating values between LUAD and LUSC. Many of the biomarkers also exhibit significant prognostic potential, particularly in LUAD. Our study also unraveled distinct biological pathways between LUAD and LUSC.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
An overview of the experimental design. A scheme summarizes the selection methods and the numbers of the resulting overlapped genes.
Figure 2
Figure 2
Venn diagram shows overlapping genes selected by each algorithm. Venn diagram of selected genes from PCA, mRMR, DGE, Lasso, and XGboost.
Figure 3
Figure 3
Heatmap shows the 131 selected genes (A) for gene expression analysis and the 17 selected genes (B) as biomarker candidates. The x-axis represents the samples and the y-axis represents the genes.
Figure 4
Figure 4
Normalized Gene Expression Distribution Dot Plots for the 17 Biomarker Candidates. The x-axis represents the NSCLC subtypes and the y-axis represents the normalized gene expression values.
Figure 5
Figure 5
ROC and AUC analysis demonstrate discriminating potential for Upregulated (a,b) and Downregulated (c) Genes in TCGA Dataset. X-axis is sensitivity, or true positive rate (TPR). The Y-axis is 1-Specificity, or false positive rate (FPR). Higher AUC indicates higher discriminating potential for the gene.
Figure 6
Figure 6
GSE28582 microarray dataset ROC and AUC validation of the 17 candidate biomarkers. (A,B) The upregulated genes, and (C) shows the downregulated genes. The x-axis represents sensitivity, or true positive rate (TPR). The y-axis is 1 − Specificity, or false positive rate (FPR). Higher AUC indicates higher discriminating potential for the gene.
Figure 7
Figure 7
Keratinization pathway is upregulated in LUSC. The Keratinization pathway is the most upregulated pathway according to Reactome analysis with p-value 3.33E−15 and FDR 1.95E−12. The boxes partially highlighted in brown indicate the number of genes identified in the analysis that are associated with each box.
Figure 8
Figure 8
Peptide elongation pathway is downregulated in LUSC when compared to LUAD. The peptide elongation pathway is the most down-regulated pathway according to Reactome analysis with p-value 9.72E−6 and FDR 0.00157. The boxes partially highlighted in brown indicate the number of genes identified in the analysis that are associated with each box.

Similar articles

Cited by

References

    1. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2020. CA Cancer J. Clin. 2020;70(1):7–30. doi: 10.3322/caac.21590. - DOI - PubMed
    1. Bray F, et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018;68(6):394–424. doi: 10.3322/caac.21492. - DOI - PubMed
    1. Herbst RS, Heymach JV, Lippman SM. Lung cancer. N. Engl. J. Med. 2008;359(13):1367–1380. doi: 10.1056/NEJMra0802714. - DOI - PMC - PubMed
    1. Chen Z, et al. Non-small-cell lung cancers: A heterogeneous set of diseases. Nat. Rev. Cancer. 2014;14(8):535–546. doi: 10.1038/nrc3775. - DOI - PMC - PubMed
    1. Relli V, et al. Abandoning the notion of non-small cell lung cancer. Trends Mol. Med. 2019;25(7):585–594. doi: 10.1016/j.molmed.2019.04.012. - DOI - PubMed

MeSH terms