Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2025 May 15;85(10):1769-1783.
doi: 10.1158/0008-5472.CAN-24-1607.

The Lung Cancer Autochthonous Model Gene Expression Database Enables Cross-Study Comparisons of the Transcriptomic Landscapes Across Mouse Models

Affiliations
Comparative Study

The Lung Cancer Autochthonous Model Gene Expression Database Enables Cross-Study Comparisons of the Transcriptomic Landscapes Across Mouse Models

Ling Cai et al. Cancer Res. .

Abstract

Lung cancer, the leading cause of cancer mortality, exhibits diverse histologic subtypes and genetic complexities. Numerous preclinical mouse models have been developed to study lung cancer, but data from these models are disparate, siloed, and difficult to compare in a centralized fashion. In this study, we established the Lung Cancer Autochthonous Model Gene Expression Database (LCAMGDB), an extensive repository of 1,354 samples from 77 transcriptomic datasets covering 974 samples from genetically engineered mouse models (GEMM), 368 samples from carcinogen-induced models, and 12 samples from a spontaneous model. Meticulous curation and collaboration with data depositors produced a robust and comprehensive database, enhancing the fidelity of the genetic landscape it depicts. The LCAMGDB aligned 859 tumors from GEMMs with human lung cancer mutations, enabling comparative analysis and revealing a pressing need to broaden the diversity of genetic aberrations modeled in the GEMMs. To accompany this resource, a web application was developed that offers researchers intuitive tools for in-depth gene expression analysis. With standardized reprocessing of gene expression data, the LCAMGDB serves as a powerful platform for cross-study comparison and lays the groundwork for future research, aiming to bridge the gap between mouse models and human lung cancer for improved translational relevance. Significance: The Lung Cancer Autochthonous Model Gene Expression Database (LCAMGDB) provides a comprehensive and accessible resource for the research community to investigate lung cancer biology in mouse models.

PubMed Disclaimer

Conflict of interest statement

The authors declare no potential conflicts of interest.

Figures

Figure 1
Figure 1. Overview of Sample Characteristics and Distribution in LCAMGDB.
a. Characteristics of individual datasets by pie charts. Each column represents a dataset, and each row corresponds to a specific attribute, with color-coding denoting the category. Attributes include Model Type, Age, Sex, Sample Type, Histology, and Primary or Metastasis status. Dark gray color denotes missing data. b-c. Sample size (b) and gene feature number (c) distribution across all datasets by bar plots. d-h. Distribution of samples by Model Type (d), Age (e), Sex (f), Sample Type (g), Tissue Type (h) and Histology (i). Note that “Lung” under Tissue Type or Histology can include normal wildtype lungs but also chemical treated lungs from toxicology studies, or genetically modified non-wildtype lungs.
Figure 2
Figure 2. Summary of GEMM genotypes in LCAMGDB.
a. Landscape of genetic modifications in LCAMGDB GEMM tumors by dataset and histology. b. Sample count by standardized genotype in GEMM tumors with Kras mutation alone. Small boxes within the bars represent samples from different datasets. c. Sample count in GEMM tumors by single gene alteration. Y-axis values are printed on the top of each bar to indicate the total number of tumors with the specific gene altered and the number of unique standardized genotypes involving the specified gene are given in parentheses. d. Count of GEMMs by the number of altered genes. Bars represent the number of GEMMs with one to four manipulated genes, irrespective of manipulation method or mutation. e. GEMM tumor alterations in LCAMGDB vs. human lung cancer genetic aberrations in the AACR GENIE v15 database by gene. Selected human oncogene and tumor suppressor genes not represented in LCAMGDB are highlighted in gray.
Figure 3.
Figure 3.. Data reprocessing by platform.
a. Hierarchical relationship of transcriptomic profiling technology and platforms. 85% (1152 samples) of the LCAMGDB gene expression data was reprocessed. b. Platforms with multiple studies reprocessed through standardized workflow. Each box within the bars represents a single dataset. c. In principal component analysis using the 1000 most variable genes from reprocessed RNA-seq data, the top two principal components account for 62% of the total variance. d-f. Distribution of 563 RNA-seq samples by source dataset (d), histology (e), and primary/metastasis status (f).
Figure 4:
Figure 4:. Interactive visualization of gene expression across multiple datasets.
This figure features the web application’s capability for users to interrogate the expression of a selected gene, Cd274 (PD-L1), across a range of datasets. The "Depositor-processed" option leverages the original data processed in the deposited datasets, optimizing the within-dataset comparisons. Users can tailor the analysis by applying filters via the dropdown menu. After selecting the appropriate parameters and clicking ’Submit,’ the application generates dot plots arrayed by the statistical significance of their expression differences, as assessed by one-way ANOVA. Displayed here are the top 6 datasets from the full results, giving users a snapshot of the gene expression landscape within the application’s extensive repository. Bars in each plot denote the group median.
Figure 5:
Figure 5:. Gene expression comparison by treatment and primary/metastasis status.
a. Expression of Cd19 revealing B cell infiltration in various treatment contexts. Data points are categorized by treatment conditions under each genotype. b. Expression of Ezh2 in primary and metastatic tumor samples. Color gradient signifying the spectrum of metastatic progression stages. P-values from one-way ANOVA are indicated, and results were ordered by statistical significance. For conciseness, only the top 4 datasets out of 10 for Cd19 (a) and the top 2 out of 5 for Ezh2 (b) included in the snapshots. Bars in each plot denote group median.
Figure 6:
Figure 6:. Expression of Cd19 in reprocessed RNA-seq data.
Each dot represents a unique sample, colored according to histology, and ordered by the median expression of Cd19. The inputs are filtered to display primary tumors only. Median of the group is shown as a bar for each row.
Figure 7:
Figure 7:. Interactive visualization of gene expression in reprocessed data merged by platform.
a. PCA plot of reprocessed RNA-seq samples, color-coded by the expression of Ascl1, a neuroendocrine lineage transcription factor highly expressed in SCLC. The interactive tooltip uncovers the origin of an outlier sample with elevated Ascl1 levels as an ADC sample from a Rb1/p53 deficient model featuring Fgfr1 activation. b. PCA plot colored by histology. Details of a SCLC sample located near the NSCLC samples are read. This outlier sample has Ascl1 knocked out, which explains the loss of neuroendocrine gene expression that renders the transcriptomic profile more similar to NSCLC.
Figure 8:
Figure 8:. Transcriptome-wide group comparison of Myc-overexpressed tumors versus wild-type lung samples.
a. User interface for defining Group 1 (Myc-overexpressed tumors) and Group 2 (wild-type lung samples) using depositor-processed data. The panel allows users to refine input datasets and samples based on variables such as genotype, histology, and treatment, with a sample tree for finalizing sample selection. b-e. Example outputs from the downloadable Excel file, including the group stratification (b), the gene-level comparison from two sample t-test (c), and pathway enrichment analysis with hypergeometric test from REACTOME pathway library for genes higher in lung (group 2) (d) or higher in tumors (e).

References

    1. Al Bakir M, Huebner A, Martinez-Ruiz C, Grigoriadis K, Watkins TBK, Pich O, et al. , The evolution of non-small cell lung cancer metastases in TRACERx. Nature, 2023. 616(7957): p. 534–542. - PMC - PubMed
    1. Campbell JD, Alexandrov A, Kim J, Wala J, Berger AH, Pedamallu CS, et al. , Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nat Genet, 2016. 48(6): p. 607–16. - PMC - PubMed
    1. Cancer Genome Atlas Research, N., Comprehensive molecular profiling of lung adenocarcinoma. Nature, 2014. 511(7511): p. 543–50. - PMC - PubMed
    1. Frankell AM, Dietzen M, Al Bakir M, Lim EL, Karasaki T, Ward S, et al. , The evolution of lung cancer and impact of subclonal selection in TRACERx. Nature, 2023. 616(7957): p. 525–533. - PMC - PubMed
    1. George J, Lim JS, Jang SJ, Cun Y, Ozretic L, Kong G, et al. , Comprehensive genomic profiles of small cell lung cancer. Nature, 2015. 524(7563): p. 47–53. - PMC - PubMed

Publication types

Grants and funding