Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 May 26:2025.05.21.655279.
doi: 10.1101/2025.05.21.655279.

Gene set optimization for cancer transcriptomics using sparse principal component analysis

Affiliations

Gene set optimization for cancer transcriptomics using sparse principal component analysis

H Robert Frost. bioRxiv. .

Abstract

A common approach for exploring pathway dysregulation in cancer involves the gene set or pathway analysis of tumor transcriptomic data. Unfortunately, the effectiveness of cancer gene set testing is limited by the fact that most gene set collections model gene activity in normal tissue, which can differ significantly from gene activity found within tumors. To address this challenge, we have developed a bioinformatics approach based on sparse principal component analysis (PCA) for optimizing existing gene set collections to reflect the pattern of gene activity in dysplastic tissue and have used this technique to optimize the Molecular Signatures Database (MSigDB) Hallmark collection for 21 solid human cancers profiled via bulk RNA-seq by The Tumor Genome Atlas (TCGA). Demonstrating the biological utility of our approach, the average survival association of gene set members is improved after optimization for nearly all cancer types and Hallmark gene sets.

PubMed Disclaimer

Conflict of interest statement

Conflict of interests The authors have no conflicts of interest to declare.

Figures

Figure 1:
Figure 1:
Proportion of Hallmark gene set annotations retained after optimization on the bulk RNA-seq data for each TCGA cohort using 1 to 3 sparse PCs.
Figure 2:
Figure 2:
Proportions of the Hallmark gene sets by outcome class after optimization on the bulk RNA-seq data for each TCGA cohort using different numbers of PCs.
Figure 3:
Figure 3:
Proportions of the Hallmark gene sets by outcome class after optimization on the bulk RNA-seq data from distinct TCGA cohorts.

Similar articles

References

    1. Vogelstein Bert, Papadopoulos Nickolas, Victor E Velculescu Shibin Zhou, Luis A Diaz Jr, and Kinzler Kenneth W. Cancer genome landscapes. Science, 339(6127):1546–58, Mar 2013. - PMC - PubMed
    1. Mutation Consequences and Pathway Analysis working group of the International Cancer Genome Consortium. Pathway and network analysis of cancer genomes. Nat Methods, 12(7):615–21, Jul 2015. - PMC - PubMed
    1. Robert Frost H. Tissue-adjusted pathway analysis of cancer (tpac): A novel approach for quantifying tumor-specific gene set dysregulation relative to normal tissue. PLoS Comput Biol, 20(1):e1011717, Jan 2024. - PMC - PubMed
    1. Cancer Genome Atlas Research Network, Weinstein John N, Collisson Eric A, Mills Gordon B, Mills Shaw Kenna R, Ozenberger Brad A, Ellrott Kyle, Shmulevich Ilya, Sander Chris, and Stuart Joshua M. The cancer genome atlas pan-cancer analysis project. Nat Genet, 45(10):1113–20, Oct 2013. - PMC - PubMed
    1. Ashburner Michael, Ball Catherine A., Blake Judith A., Botstein David, Butler Heather, Michael Cherry J., Davis Allan P., Dolinski Kara, Dwight Selina S., Eppig Janan T., Harris Midori A., Hill David P., Laurie Issel-Tarver Andrew Kasarskis, Lewis Suzanna, Matese John C., Richardson Joel E., Ringwald Martin, Rubin Gerald M., and Sherlock Gavin. Gene ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29, May 2000. - PMC - PubMed

Publication types

LinkOut - more resources