Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 23;20(7):e0327773.
doi: 10.1371/journal.pone.0327773. eCollection 2025.

An autoencoder learning method for predicting breast cancer subtypes

Affiliations

An autoencoder learning method for predicting breast cancer subtypes

Zahra Rostami et al. PLoS One. .

Abstract

Heterogeneity of breast cancer poses several challenges for detection and treatment. With next-generation sequencing, we can now map the transcriptional profile of each patient's breast tissue, which has the potential for identifying and characterizing cancer subtypes. However, the large dimensionality of this transcriptomic data and the heterogeneity between the molecular profiles of breast cancers poses a barrier to identifying minimal markers and mechanistic consequences. In this study, we develop an autoencoder to identify a reduced set of gene markers that characterize the four major breast cancer subtypes with the accuracy of 82.38%. The reduced feature space created by our model captures the functional characteristics of each breast cancer subtype highlighting mechanisms that are unique to each subtype as well as those that are shared. Our high prediction accuracy shows that our markers can be valuable for breast cancer subtype detection and have the potential to provide insights into mechanisms associated with each subtype.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Schematic of our autoencoder-based feature extraction framework.
a) We collect data and categorize it into the four subtypes of breast cancer, namely TNBC, HER2-enriched, luminal A, and luminal B. The gene expression data for each subtype is passed to the autoencoder one at a time. b) For each subtype, we perform autoencoder-based feature inference and select the top 3000 genes that have the highest feature scores according to our algorithm. c) For each subtype, we run the autoencoder 30 times with a fixed seed and find the overlap of the resulting gene sets (each containing 3000 genes). We repeat this whole step with 16 different random seeds. d) Taking the overlap of the results from all 16 seeds for each subtype, we identify 704, 900, 158, and 865 marker genes for TNBC, HER2-enriched, luminal A and luminal B cancers, respectively. e) The marker genes are examined via two routes: first, multiclass classification using RF; second, assessment of the protein-protein interactions of the marker gene for each subtype separately. The latter involves MCL clustering of the PPI network derived from the marker genes of each cancer to capture distinct functional modules associated with each BC subtype.
Fig 2
Fig 2. The loss curve representing the performance of the autoencoder for each BC subtype.
a) TNBC cancer, b) HER2-enriched cancer c) luminal A cancer, d) luminal B cancer. The x-axis shows the number of epochs. The curves demonstrate that the model fits well on the data, does not overfit, and stays consistent throughout training.
Fig 3
Fig 3. Receiver operating characteristic curves showing the performance of the model.
Fig 4
Fig 4. Comparing functional profile of the four BC subtypes using the genes derived from our model.
The plot shows the 18 most significant gene ontologies. The node size represents k/n ratio, where n is the size of the list of genes corresponding to a subtype (this number is shown in parenthesis on the x-axis) and k is the number of genes within that list that are annotated to the node. The dot colors indicate the adjusted p-values.
Fig 5
Fig 5. Functional modules and super-modules associated with different BC subtypes.
The genes associated with TNBC, HER2-enriched, luminal A and luminal B cancers are shown in green, blue, orange, and pink, respectively; a) Autophagy super-module (CAMKK2, shown in gray, is common between TNBC, HER2-enriched and luminal B cancers), b) Ciliary trafficking machinery and cilium assembly super-module, c) Chromatin organization and remodeling super-module, d) NF-kappa B signaling module, e) Centrosome localization and biogenesis module.

Similar articles

References

    1. Dhiman P, Bonkra A, Kaur A, Gulzar Y, Hamid Y, Mir MS, et al. Healthcare Trust Evolution with Explainable Artificial Intelligence: Bibliometric Analysis. Information. 2023;14(10):541. doi: 10.3390/info14100541 - DOI
    1. Alharbi F, Vakanski A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Bioengineering (Basel). 2023;10(2):173. doi: 10.3390/bioengineering10020173 - DOI - PMC - PubMed
    1. Yoosuf N, Navarro JF, Salmén F, Ståhl PL, Daub CO. Identification and transfer of spatial transcriptomics signatures for cancer diagnosis. Breast Cancer Res. 2020;22(1):6. doi: 10.1186/s13058-019-1242-9 - DOI - PMC - PubMed
    1. Monjo T, Koido M, Nagasawa S, Suzuki Y, Kamatani Y. Efficient prediction of a spatial transcriptomics profile better characterizes breast cancer tissue sections without costly experimentation. Sci Rep. 2022;12(1):4133. doi: 10.1038/s41598-022-07685-4 - DOI - PMC - PubMed
    1. Shibahara T, Wada C, Yamashita Y, Fujita K, Sato M, Kuwata J, et al. Deep learning generates custom-made logistic regression models for explaining how breast cancer subtypes are classified. PLoS One. 2023;18(5):e0286072. doi: 10.1371/journal.pone.0286072 - DOI - PMC - PubMed

Substances