Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar;17(3):4281-4290.
doi: 10.3892/mmr.2018.8398. Epub 2018 Jan 9.

Feature genes in metastatic breast cancer identified by MetaDE and SVM classifier methods

Affiliations

Feature genes in metastatic breast cancer identified by MetaDE and SVM classifier methods

Youlin Tuo et al. Mol Med Rep. 2018 Mar.

Abstract

The aim of the present study was to investigate the feature genes in metastatic breast cancer samples. A total of 5 expression profiles of metastatic breast cancer samples were downloaded from the Gene Expression Omnibus database, which were then analyzed using the MetaQC and MetaDE packages in R language. The feature genes between metastasis and non‑metastasis samples were screened under the threshold of P<0.05. Based on the protein‑protein interactions (PPIs) in the Biological General Repository for Interaction Datasets, Human Protein Reference Database and Biomolecular Interaction Network Database, the PPI network of the feature genes was constructed. The feature genes identified by topological characteristics were then used for support vector machine (SVM) classifier training and verification. The accuracy of the SVM classifier was then evaluated using another independent dataset from The Cancer Genome Atlas database. Finally, function and pathway enrichment analyses for genes in the SVM classifier were performed. A total of 541 feature genes were identified between metastatic and non‑metastatic samples. The top 10 genes with the highest betweenness centrality values in the PPI network of feature genes were Nuclear RNA Export Factor 1, cyclin‑dependent kinase 2 (CDK2), myelocytomatosis proto‑oncogene protein (MYC), Cullin 5, SHC Adaptor Protein 1, Clathrin heavy chain, Nucleolin, WD repeat domain 1, proteasome 26S subunit non‑ATPase 2 and telomeric repeat binding factor 2. The cyclin‑dependent kinase inhibitor 1A (CDKN1A), E2F transcription factor 1 (E2F1), and MYC interacted with CDK2. The SVM classifier constructed by the top 30 feature genes was able to distinguish metastatic samples from non‑metastatic samples [correct rate, specificity, positive predictive value and negative predictive value >0.89; sensitivity >0.84; area under the receiver operating characteristic curve (AUROC) >0.96]. The verification of the SVM classifier in an independent dataset (35 metastatic samples and 143 non‑metastatic samples) revealed an accuracy of 94.38% and AUROC of 0.958. Cell cycle associated functions and pathways were the most significant terms of the 30 feature genes. A SVM classifier was constructed to assess the possibility of breast cancer metastasis, which presented high accuracy in several independent datasets. CDK2, CDKN1A, E2F1 and MYC were indicated as the potential feature genes in metastatic breast cancer.

Keywords: breast cancer; metastasis; protein-protein interactions; feature gene; support vector machine classifier.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Quality control results of the merged datasets from 5 microarray profiles (marked as 1–5) obtained via MetaQC analysis. The first principal component is presented on the x-axis, while the second principal component is shown on the y-axis. QC, quality control; IQC, internal QC; EQC, external QC; AQCg, accuracy QC; AQCp, precision of AQCg; CQCg, consistency QC; CQCp, precision of CQCg.
Figure 2.
Figure 2.
Protein-protein interaction network of feature genes. Green nodes are the genes that exhibited higher expression in metastatic samples, while the purple nodes are those that exhibited lower expression in metastatic samples when compared with non-metastatic samples.
Figure 3.
Figure 3.
Distribution of node degrees in the protein-protein interaction network of feature genes. The x-axis is the log (degree) value and the y-axis is the corresponding node numbers to the degree.
Figure 4.
Figure 4.
Accuracy and efficacy of the support vector machine classifier. (A) The accuracy and error ratio of the classifier at different gene numbers (top 10 to top 50). (B) The classification efficacy of the classifier constructed using the top 30 genes for samples in the GSE46928 dataset. Non-metastatic samples are marked in black and the metastatic samples are marked in red.
Figure 5.
Figure 5.
Clustering heatmap of the top 30 genes and samples in the training dataset. The color gradient from red to green represents the changes in expression level from high to low. The bars represent the samples (orange refers to metastatic samples; purple refers to non-metastatic samples). Met, metastatic samples; Non, non-metastatic samples.
Figure 6.
Figure 6.
Classification results on other microarray profiles, including (A) GSE29431, (B) GSE39494, (C) GSE43837 and (D) GSE46826. Non-metastatic samples are marked in black and metastatic samples are marked in red. The receiver operating characteristic curves of the classifier are displayed on the right-hand side. AUC, area under the curve.
Figure 6.
Figure 6.
Classification results on other microarray profiles, including (A) GSE29431, (B) GSE39494, (C) GSE43837 and (D) GSE46826. Non-metastatic samples are marked in black and metastatic samples are marked in red. The receiver operating characteristic curves of the classifier are displayed on the right-hand side. AUC, area under the curve.
Figure 7.
Figure 7.
Classification effect of the support vector machine classifier on an independent sample from The Cancer Genome Atlas database. (A) The spot graph of the different samples (non-metastatic samples are marked in black and metastatic samples are marked in red). (B) The receiver operating characteristic curve and (C) the survival curve. AUC, area under the curve.
Figure 8.
Figure 8.
Enriched functions of the 30 feature genes. Gene numbers are displayed on the x-axis. The color represents the -log (P-value) and the changes from red to blue represents high -log (P-value) to low -log (P-value).

Similar articles

Cited by

References

    1. DeSantis C, Ma J, Bryan L, Jemal A. Breast cancer statistics, 2013. CA Cancer J Clin. 2014;64:52–62. doi: 10.3322/caac.21203. - DOI - PubMed
    1. Jemal A, Siegel R, Xu J, Ward E. Cancer statistics, 2010. CA Cancer J Clin. 2010;60:277–300. doi: 10.3322/caac.20073. - DOI - PubMed
    1. Weigelt B, Peterse JL, van't Veer LJ. Breast cancer metastasis: Markers and models. Nat Rev Cancer. 2005;5:591–602. doi: 10.1038/nrc1670. - DOI - PubMed
    1. Sleeman J, Steeg PS. Cancer metastasis as a therapeutic target. Eur J Cancer. 2010;46:1177–1180. doi: 10.1016/j.ejca.2010.02.039. - DOI - PMC - PubMed
    1. Khan S, Shukla S, Sinha S, Lakra AD, Bora HK, Meeran SM. Centchroman suppresses breast cancer metastasis by reversing epithelial-mesenchymal transition via downregulation of HER2/ERK1/2/MMP-9 signaling. Int J Biochem Cell Biol. 2015;58:1–16. doi: 10.1016/j.biocel.2014.10.028. - DOI - PubMed

MeSH terms