Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 23:13:905523.
doi: 10.3389/fphys.2022.905523. eCollection 2022.

ESRRG, ATP4A, and ATP4B as Diagnostic Biomarkers for Gastric Cancer: A Bioinformatic Analysis Based on Machine Learning

Affiliations

ESRRG, ATP4A, and ATP4B as Diagnostic Biomarkers for Gastric Cancer: A Bioinformatic Analysis Based on Machine Learning

Qiu Chen et al. Front Physiol. .

Abstract

Based on multiple bioinformatics methods and machine learning techniques, this study was designed to explore potential hub genes of gastric cancer with a diagnostic value. The novel biomarkers were detected through multiple databases of gastric cancer-related genes. The NCBI Gene Expression Omnibus (GEO) database was used to obtain gene expression files. Three hub genes (ESRRG, ATP4A, and ATP4B) were detected through a combination of weighted gene co-expression network analysis (WGCNA), gene-gene interaction network analysis, and supervised feature selection method. GEPIA2 was used to verify the differences in the expression levels of the hub genes in normal and cancer tissues in the RNA-seq levels of Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) databases. The objectivity of potential hub genes was also verified by immunohistochemistry in the Human Protein Atlas (HPA) database and transcription factor-hub gene regulatory network. Machine learning (ML) methods including data pre-processing, model selection and cross-validation, and performance evaluation were examined on the hub-gene expression profiles in five Gene Expression Omnibus datasets and verified on a GEO external validation (EV) dataset. Six supervised learning models (support vector machine, random forest, k-nearest neighbors, neural network, decision tree, and eXtreme Gradient Boosting) and one semi-supervised learning model (label spreading) were established to evaluate the diagnostic value of biomarkers. Among the six supervised models, the support vector machine (SVM) algorithm was the most effective one according to calculated performance metrics, including 0.93 and 0.99 area under the curve (AUC) scores on the test and external validation datasets, respectively. Furthermore, the semi-supervised model could also successfully learn and predict sample types, achieving a 0.986 AUC score on the EV dataset, even when 10% samples in the five GEO datasets were labeled. In conclusion, three hub genes (ATP4A, ATP4B, and ESRRG) closely related to gastric cancer were mined, based on which the ML diagnostic model of gastric cancer was conducted.

Keywords: WGCNA; bioinformatics; diagnostic model; gastric cancer; machine learning.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Flowchart of this study.
FIGURE 2
FIGURE 2
Progress of the weighted gene co-expression network analysis in GSE66229. (A) Cluster dendrogram of 161 samples in GSE66229. (B) Soft thresholds of the best scale-free topological model fitting index (left) and mean connectivity (right) were determined. The red horizontal line represents R 2 = 0.86. (C) Dendrogram of all genes clustered in GSE66229. Gene clustering into modules is based on a topological overlap matrix. Assigned modules are colored on the bottom with gray denoting unassigned genes.
FIGURE 3
FIGURE 3
Heatmap of the relationship between module eigengenes and clinical traits of GSE66229. WGCNA labeled heatmaps for GSE66229, each row represents a module characteristic gene encoded by color, and the three columns represent clinical characteristics of overall survival time (OST), overall survival status (OSS), and sample type, respectively. Each cell represents the Pearson correlation coefficient and p-value (in parentheses) of the corresponding module characteristics, and the color of each cell represents the value of correlation.
FIGURE 4
FIGURE 4
Gene–gene interaction network of the top-ranked 10% genes in red modules.
FIGURE 5
FIGURE 5
Validation of three hub gene expressions in the GEPIA2 platform. (A) Validation of three hub gene expressions in the GEPIA2 platform. The red and gray boxes represent cancer and normal tissues in the TCGA and GTEx datasets, respectively. STAD, gastic cancer, and p < 0.01 (GEPIA2 website). (B) Immunohistochemical staining of ESRRG, ATP4A, and ATP4B in the Human Protein Atlas (HPA) database. (C) Transcription factor–hub gene regulatory network of the most relevant factor in the Cytoscape plugin “iRegulon”.
FIGURE 6
FIGURE 6
Performance of the six supervised machine learning models on the test and EV sets. Hyperparameters of all six models are tuned with the GridSearchCV method, according to the “MCC” metric, and then, the six best models were chosen after exploration of the whole grid. Predictions on the test and EV sets are made with the best models. Six models used in this study are support vector machine (SVM), k-nearest neighbors (KNN), decision tree (DT), random forest (RF), neural network (NN), and eXtreme Gradient Boosting (XGB) in order. (A,B) Scores of accuracy, F1 score, MCC, precision, sensitivity, and specificity in the six models on the test and valid datasets, respectively. (C,D) Four terms of the confusion matrix (TP, TN, FP, and FN) in the six models on the test and valid datasets, respectively.
FIGURE 7
FIGURE 7
ROC curves for the predicted probability on the test and EV sets of all six machine learning diagnostic models: (A) SVM, (B) RF, (C) KNN, (D) NN, (E) DT and (F) XGB.
FIGURE 8
FIGURE 8
Performance of the semi-supervised machine learning model with various ratios of unlabeled data. Semi-supervised machine learning models are built with the label spreading (LS) algorithm. The ratios of randomly unlabeled samples include 50% (LS50), 60% (LS60), 70% (LS70), 80% (LS80), and 90% (LS90). In each ratio, the semi-supervised model is cross-validated 100 times by random permutation. (A,B) Performance of the semi-supervised machine learning models on all unlabeled data and the valid dataset with various ratios of unknown samples, respectively. Seven metrics are given, namely, accuracy, F1 score, MCC, precision, sensitivity, specificity, and AUC.

Similar articles

Cited by

References

    1. Ahluwalia P., Kolhe R., Gahlay G. K. (2021). The Clinical Relevance of Gene Expression Based Prognostic Signatures in Colorectal Cancer. Biochimica Biophysica Acta (BBA) - Rev. Cancer 1875 (2), 188513. 10.1016/j.bbcan.2021.188513 - DOI - PubMed
    1. Ali H. E. A., Lung P.-Y., Sholl A. B., Gad S. A., Bustamante J. J., Ali H. I., et al. (2018). Dysregulated Gene Expression Predicts Tumor Aggressiveness in African-American Prostate Cancer Patients. Sci. Rep. 8 (1), 16335. 10.1038/s41598-018-34637-8 - DOI - PMC - PubMed
    1. Altman D. G., Bland J. M. (1994). Statistics Notes: Diagnostic Tests 1: Sensitivity and Specificity. BMJ 308 (6943), 1552. 10.1136/bmj.308.6943.1552 - DOI - PMC - PubMed
    1. Asplund J., Kauppila J. H., Mattsson F., Lagergren J. (2018). Survival Trends in Gastric Adenocarcinoma: A Population-Based Study in Sweden. Ann. Surg. Oncol. 25 (9), 2693–2702. 10.1245/s10434-018-6627-y - DOI - PMC - PubMed
    1. Assenov Y., Ramírez F., Schelhorn S.-E., Lengauer T., Albrecht M. (2007). Computing Topological Parameters of Biological Networks. Bioinformatics 24 (2), 282–284. 10.1093/bioinformatics/btm554 - DOI - PubMed

LinkOut - more resources