Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 9;16(1):7365.
doi: 10.1038/s41467-025-62459-6.

Non-coding genetic elements of lung cancer identified using whole genome sequencing in 13,722 Chinese

Affiliations

Non-coding genetic elements of lung cancer identified using whole genome sequencing in 13,722 Chinese

Dan Zhou et al. Nat Commun. .

Abstract

A substantial portion of lung cancer-associated genetic elements in East Asian populations remains unidentified, underscoring the need for large-scale genome-wide studies, particularly on non-coding regulation. We conducted a whole genome sequencing (WGS)-based genome-wide scan in 13,722 Chinese individuals to identify regulatory elements associated with lung cancer. We verified common-variant-based loci by meta-analysis across the available East Asian studies. Integrating a genome-transcriptome reference panel of lung tissue in 297 Chinese, we bridged the variant-lung cancer associations, highlighting genes including TP63 and DCBLD1. Implementing the STAAR pipeline for rare variant aggregate analysis, we identified and replicated novel genes, including PARPBP, PLA2G4C, and RITA1 in the context of non-coding regulation. Adapting a deep learning-based approach, potential upstream regulators such as TP53, MYC, ZEB1, and NFKB1 were revealed for the lung cancer-associated genes. These findings offered crucial insights into the non-coding regulation for the etiology of lung cancer, providing additional potential targets for intervention.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Study overview.
A genome-wide scan was performed to probe lung cancer-associated elements across the whole allele frequency spectrum. A genome-wide association study was carried out to identify trait-associated common variants. Informed by the SNP-expression regulation in lung tissues, the variant-level signal was projected to the gene level by a transcriptome-wide association study. Following the STAAR pipeline, lung cancer-associated genes were revealed by rare variant-based gene-centric and fixed/dynamic sliding window analyses. Fine-mapping and enrichment analyses were implemented to provide additional insights into the etiology and genetic architecture.
Fig. 2
Fig. 2. Whole genome sequencing variant calling, quality control, and principal component analysis.
The blue and red dots denote the 50 CHB/CHS samples before and after refinement (a: SNP, b: INDEL). Refinement largely increased the variant concordance (y-axis) between the officially released genotype and the genotype called by the current pipeline, especially for samples with low sequencing depths (x-axis). c, d show the association (linear regression) between the number of variants (x-axis) for each sample and the sequencing depth (y-axis) for relatively common variants (P = 2.78 × 10−12) and rare variants (P = 1.00 × 10−599), respectively. Principal component analysis shows the genetic background of the study samples by pooling the participants from the 1000 genome projects phase 3 in (e). In (f), the PC1 and PC2 were visualized for lung cancer cases and controls in the discovery stage and the replication stage, respectively.
Fig. 3
Fig. 3. Common variant-based genome-wide and transcriptome-wide association analyses for East Asian samples.
The meta-analyzed genome-wide association results (generalized linear mixed-effect model) were displayed in a Manhattan plot and a QQ plot in (a and b), respectively. The Manhattan plot for the transcriptome-wide association study (linear regression model) projected the SNP-level signal to the gene level by assuming gene expression regulation as the mediator. Each tested gene-tissue/cell type pairs were marked as a dot or a hollow diamond if the significance passed the multiple testing adjustment. Genes with top-ranked associations were labeled, followed by the source from where the gene expression model was trained (c). The raw p-values are presented in (ac). For each gene, we only labeled the one with the lowest p-value among tissue or cell types. The correlation between predicted and observed gene expression of TP63 in lung tissue samples was visualized in (d). Each dot denotes a lung tissue sample.
Fig. 4
Fig. 4. Rare variant-based gene-centric and sliding/dynamic window scan across the genome.
a shows a flowchart for the rare variant-based analyses in the discovery and the replication stages. The -log(P) of the gene-centric STAAR-O test results in the discovery stage were visualized in (b) (the raw p-values are presented). Each dot denotes a gene-category pair. Hollow triangles denote genes that passed the stratified FDR correction in the discovery stage and remained nominally significant either in the replication stage or in the UK Biobank WES-based results. Categorically matched (between the discovery and the replication stages) replicable genes were labeled with gene names and their categories. c displays the enhancer enrichment analysis results for the lung cancer-associated segments and their related genes identified by fixed and dynamic window-based tests. Two segments near FRMD6 were merged since they share a large proportion of positions. The enrichment test was performed to evaluate the representative levels of a segment overlapping an enhancer, given the background signal in multiple lung and immune-related cell lines.
Fig. 5
Fig. 5. Deep learning-based causal variant mapping identifies potential regulators.
For lung cancer-associated genes that showed concordance signals in both the discovery and the replication stage, we performed causal variant mapping to identify potential upstream regulators. For each gene, transcriptional factors (TFs) whose binding motif may be affected by single-nucleotide variant (SNV) were linked to the gene with blue lines. The darker blue lines indicate more instances from different SNVs or from different lung/immune-related cell lines. ad show the results for EFHD2, ENO1, PLA2G4C, and RITA1, respectively.
Fig. 6
Fig. 6. Enrichment analysis for lung cancer-associated genes.
a By incorporating scRNA-seq data, cell type-specific enrichment analysis was performed to prioritize lung cancer-related cell types given the identified genes. The top heatmap shows the enrichment score for each gene-cell type pair. Hierarchical clustering was performed. The bottom row shows the significance of the enrichment estimated by the one-sided permutation test (“Methods”). b Implementing the Gene Set Enrichment Analysis (GSEA), we revealed the convergences of lung cancer-associated genes identified by WGS-based study and studies in other design or using other omics data. The heatmap shows the significance (raw p-values) of enrichment between category-specific gene sets identified in the current WGS-based study and the genes reported from the COMIC CSC gene list (with the prefix “cosmic”), GWAS catalog (with the prefix “GWAS”), differential gene expression analysis between tumor and normal lung tissues (up denotes a higher expression level in tumor samples), TWAS from blood and lung tissues, and somatic copy number variations in lung cancer samples (b). The WES-based UK Biobank gene-centric results were also included.

References

    1. Xia, C. et al. Cancer statistics in China and United States, 2022: profiles, trends, and determinants. Chin. Med. J.135, 584–590 (2022). - PMC - PubMed
    1. Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin.71, 209–249 (2021). - PubMed
    1. Mucci, L. A. et al. Familial risk and heritability of cancer among twins in Nordic countries. Jama315, 68–76 (2016). - PMC - PubMed
    1. Dai, J. et al. Identification of risk loci and a polygenic risk score for lung cancer: a large-scale prospective cohort study in Chinese populations. Lancet Respir. Med.7, 881–891 (2019). - PMC - PubMed
    1. Bossé, Y. & Amos, C. I. A decade of GWAS results in lung cancer. Cancer Epidemiol. Biomark. Prev.27, 363–379 (2018). - PMC - PubMed

Supplementary concepts