Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2026 Jan;649(8099):1282-1291.
doi: 10.1038/s41586-025-09812-3. Epub 2025 Dec 3.

Whole-genome landscapes of 1,364 breast cancers

Affiliations

Whole-genome landscapes of 1,364 breast cancers

Ryul Kim et al. Nature. 2026 Jan.

Abstract

Breast cancer remains a major global health challenge1. Here, to comprehensively characterize its genomic landscape and the clinical significance of genomic characteristics, we analysed whole-genome sequences from 1,364 clinically annotated breast cancers, with transcriptome data available for most cases. Our study expands the repertoire of oncogenic alterations and identifies novel driver genes, recurrent gene fusions, structural variants and copy number alterations. Timing analyses on copy number alterations suggest that genomic instability emerges decades before tumour diagnosis, and offer insights into early initiation of tumorigenesis. Pattern-driven genomic features, including mutational signatures2, homologous recombination deficiency3, tumour mutational burden and tumour heterogeneity scores4, were associated with clinical outcomes, highlighting their potential utility as predictive biomarkers for clinical evaluation of treatments such as CDK4/6 and HER2 inhibitors, as well as adjuvant and neoadjuvant chemotherapy. These findings highlight the power of large-scale, clinically annotated whole-genome sequencing in advancing our understanding of how genomic alterations shape patient outcomes.

PubMed Disclaimer

Conflict of interest statement

Competing interests: Y.S.J. and J.S.L. are co-founders of Inocras, a San Diego-based precision medicine company. Y.H.P. has received grants from MSD, AstraZeneca, Pfizer, Gencurix, Roche, Inocras and Novartis, and consulting fees from AstraZeneca, MSD, Pfizer, Eisai, Lilly, Roche, Gilead, Daiichi-Sankyo, Menarini, Everest and Novartis. R.K., J. Lim, B.B.-L.O., E.C.-S., S.L., B.R.L., Y.L., K.J.Y., Y.O.K., I.H.C., J.P., J. Kim, C.C., J.Y.S., H.L., M.K., H.P., I.J., B.Y. and W.-C.L. are employees of Inocras. The other authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Driver genes in breast cancer.
Seven driver gene identification algorithms (SMRegions, OncodriveFML, OncodriveCLUSTL, MutPanning, HotMAPS, dNdScv and CBaSE) were applied to identify protein-coding driver genes, generating a combined significance score and identifying 41 driver genes in breast cancer. Top to bottom: the number of affected cases for each gene, ranked by mutation frequency across the cohort; PAM50 subtype for each mutated gene; integrative clusters (IntClust) subtype for each mutated gene; types of point mutations; results from the seven driver gene detection algorithms; a combined significance score, summarizing the overall support for each gene as a driver; genes that were not observed in any external dataset were classified as putative novel drivers, highlighted in blue; and the number of external breast cancer cohorts (out of 14) in which each gene was reported. Del, deletion; ins, insertion; LumA, luminal A; lumB, luminal B; NA, not available; mut, mutation; subs, substitution.
Fig. 2
Fig. 2. Mutational signatures in breast cancer.
a, Distribution of mutational signatures across breast cancer subtypes. Top, number of mutations per signature for individual samples. Bottom, heat map showing the percentage of individuals within each subtype for whom a signature contributes more than 10% of total mutations. b, Correlation between signatures based on Pearson’s correlation coefficient. c, HRD score distribution (top) and receiver operating characteristic curve for predicting germline BRCA1 and BRCA2 mutations (bottom). d, Distribution of HRD and HRP cancers across all samples and subtypes; outer tracks indicate germline or somatic homologous recombination pathway mutations. e, Germline and somatic mutations in the homologous recombination pathway in cases with HRD. f, Disease-free survival (DFS) in patients with TNBC according to homologous recombination pathway status after adjuvant anthracycline–cyclophosphamide chemotherapy. Left, Kaplan–Meier curves for HRD versus HRP. Right, multivariate Cox proportional hazards regression (two-sided Wald test, no adjustment for multiple comparisons; n = 89). Error bars indicate 95% confidence intervals (CIs) and the centre dot is defined as the hazard ratio. g, Distribution of germline deletion of APOBEC3A/3B in our cohort (left), a healthy Korean population (middle) and Europeans with breast cancer (right). h, TMB (left), ratio of C>G and C>T mutations in YpTpCpA versus RpTpCpA contexts (middle) and sequence logo plots for C>T and C>G mutations at TC sites (right) in carriers of APOBEC3A/3B germline deletions. nYTCA and nRTCA represent the number of mutations occuring in YpTpCpA and RpTpCpA sequence contexts, respectively. Box plots indicate median (middle line), first and third quartiles (box edges) and 1.5× IQR (whiskers). P values were calculated using a two-sided Student’s t-test. Het, heterozygote; hom, homozygote; WT, wild type. i, APOBEC3A/3B germline status according to APOBEC mutational signature. Bars indicate the proportion of germline status in groups (n = 30) ordered by signature intensity. In all panels, n refers to the number of participants.
Fig. 3
Fig. 3. Whole-genome structural variation landscape.
a, Genome-wide structural variation (SV) and copy number variation (CNV) landscape. Top, interchromosomal and intrachromosomal structural variant interactions across 5-Mb windows. Frequent events between chr. 8 and chr. 11 (top left) and between chr. 17 and chr. 20 (top right) are highlighted. Middle, number of samples with structural variants or super-enhancers (SEs) per window. Bottom, recurrent CNVs identified by GISTIC. b, Distribution of distal structural variant breakpoints (BPs) within 1 Mb of ERBB2 (grey) and breast cancer-specific super-enhancers (red). c, Structural variant breakpoint density within 1 Mb of ERBB2. Peaks at −99.0 kb (5′) and +91.7 kb (3′) indicate clustering near the gene. d, Effect of extragenic structural variations (<1 Mb from gene) on RNA expression. Genes with significant upregulation (log2(fold change) ≥ 1, q < 0.10) are highlighted. Dot size reflects the number of affected samples. e, CCF of extragenic structural variants affecting genes highlighted in d. f, Permutation test (10,000 iterations) shows that structural variant breakpoints (from d) are closer to super-enhancers than expected by chance (two-sided; P = 0.001). g, Left, genes that are downregulated by intragenic structural variants (q < 0.10). Well-known tumour suppressors (RB1, PTEN and RUNX1) are shown in bold. Right, PAM50 subtype distribution. h, Recurrent gene fusions (occurring in at least four participants). Number of patients and whole-transcriptome sequencing (WTS) support (left), and CCF (middle) and PAM50 (right) subtype for each fusion. i, NRG1 fusions in 17 patients in our cohort. Protein domain architecture of the N-terminal fusion partner and C-terminal NRG1 are shown. CCF values indicate the clonality of each fusion event. ADAM CR, ADAM family cysteine-rich domain; Del, deletion; DNA pol A exo 1, exonuclease 1 domain; Dup, duplication; glyco tran 10 N, glycosyltransferase family 10 domain; glyco transf 64, glycotransferase 64 domain; guanylate cyc 2, guanylate cyclase 2 domain; Inv, inversion; I-set, immunoglobulin-like domain; KH1, KH-type RNA-binding domain 1; Tra, translocation.
Fig. 4
Fig. 4. CNAs in breast cancer with clinical and evolutionary implications.
ah, Focal CNAs, particularly ERBB2, and their clinical relevance. il, CNAs and their temporal dynamics with implications for therapeutic resistance. a, Genome-wide F-score plots highlight recurrent focal amplifications at chromosomes 8, 11 and 17. Orange rings indicate regions that are frequently amplified as ecDNA. b, Copy number profiles at the three major focal peaks. Black lines show mean copy number and grey shading indicates 95% confidence intervals. c, Amplification mechanisms of the oncogenes in b. BFB, breakage–fusion bridge. d, Distribution of PAM50 subtype, ERBB2 amplification type, copy number and expression across HER2 IHC and fluorescence in situ hybridization (ISH) groups. Amp, amplification. e, Clinicogenomic profiles of 75 HER2-positive breast cancers treated with neoadjuvant docetaxel–carboplatin–trastuzumab–pertuzumab (TCHP), stratified by pCR (n = 38) and non-pCR (n = 37). The top 10 most mutated genes are shown. ChT, chromothripsis; PR, progesterone receptor. f, Sankey plots showing associations of HER2 IHC 3+, ERBB2 copy number ≥33 and ecDNA status with pCR (two-sided chi-square test). g, Performance metrics comparing HER2 IHC 3+, ERBB2 copy number ≥33 and chromothripsis for predicting pCR. CN, copy number; LR, likelihood ratio; OR, odds ratio. h, Chromothripsis frequency in pCR versus non-pCR groups (two-sided chi-square test). i, Prognostic effect of 9p23 amplification in basal-like breast cancer. n = 252. Top, Kaplan–Meier overall survival. Bottom, multivariate Cox regression. Two-sided Wald test without adjustment for multiple comparisons. Error bars indicate 95% confidence intervals and the centre dot is defined as the hazard ratio. j,k, Circos plots and CNA timing curves for two cases with HRD: a triple-negative tumour (j) and a hormone receptor-positive tumour with a germline BRCA1 1-bp deletion (k). l, Integrative Genomics Viewer snapshot for patient 703, showing a somatic 8-bp deletion overlapping a germline 1-bp deletion in BRCA1 exon 14, which is likely to restore the reading frame. In all panels, n refers to the number of participants.
Fig. 5
Fig. 5. WGS-based biomarkers for first-line palliative treatment of metastatic breast cancer.
ac, Swimmer plot (a), Kaplan–Meier curves (b) and multivariate Cox proportional hazard analysis (c) comparing PFS according to MATH score in patients with HER2-positive breast cancer who received first-line anti-HER2 treatment as palliative treatment. a, Patients are stratified into two groups: MATH < 40 (n = 27) and MATH ≥ 40 (n = 18). Bars represent individual patients, with progression events indicated in yellow. Bottom, clinicopathologic and genomic features, including treatment regimen, ER and PR expression status, ERBB2 transcript expression as copy number (CN) gain and TPM, Ki67 index, tumour ploidy, HRD score and MATH score. L, lapatinib; Pro, prospective; retro, retrospective; T, taxane; TP, taxane plus platinum. d, Genomic and molecular characteristics of patients with hormone receptor-positive breast cancer (n = 57) treated with first-line CDK4/6 inhibitors (CDK4/6i), stratified by progression status. Top panel displays patient recruitment type, CDK4/6i regimen, PAM50 subtype, Ki67 expression, ploidy, MATH score, TMB and HRD score. Bottom, somatic mutations in recurrently altered cancer-related genes. e, Kaplan–Meier survival curves comparing PFS between individuals with high versus low TMB. f, Kaplan–Meier survival curves comparing PFS between HRD and HRP cases. g, Multivariate Cox proportional hazards analysis. The model accounts for potential correlations between TMB and HRD by incorporating a HRD:TMB interaction term. Multivariate Cox proportional hazards analysis was performed using a two-sided Wald test. Error bars indicate 95% confidence intervals, with the centre dot defined as the hazard ratio. AI, aromatase inhibitor; GNRH, gonadotrophin-releasing hormone; SERD, selective oestrogen receptor degrader. In all panels, n refers to the number of participants included in the analysis.
Extended Data Fig. 1
Extended Data Fig. 1. Comprehensive overview of the genomic and molecular characteristics of 1,364 breast cancer samples.
Columns are ordered to visualize mutual exclusivity between alterations across samples, highlighting potential co-occurrence and exclusivity patterns among key genomic events. Rows represent driver genes identified by seven different driver gene detection algorithms: SMRegions, OncodriveFML, OncodriveCLUSTL, MutPanning, HotMAPS, dNdScv, and CBaSE, providing a robust and integrative approach to detecting candidate driver mutations. Abbreviations: TMB, tumor mutational burden; ER, estrogen receptor; PR, progesterone receptor; HRD, homologous recombination deficiency; MATH, mutant-allele tumor heterogeneity; SBS, single base substitution; ID, indel; SV, structural variation.
Extended Data Fig. 2
Extended Data Fig. 2. Rare but putatively novel driver genes identified in this study.
Each lollipop plot represents the distribution of mutations along the protein-coding sequence of each gene: (a) BCL11B, (b) RREB1, (c) RAF1, and (d) SPECC1. The x-axis corresponds to the amino acid position, while the y-axis indicates the count of samples in which a given mutation was identified. Circles denote mutation types and their frequency (number of samples in which the mutation was observed). Colored rectangles on the coding sequence represent distinct functional domains of each protein.
Extended Data Fig. 3
Extended Data Fig. 3. Genomic instability associated with TP53 mutation.
a–c, Distribution of homologous recombination deficient (HRD) and proficient (HRP) tumors (a), ploidy (b), and mutant allele tumor heterogeneity (MATH) scores (c) according to TP53 mutation status (TP53wt, wild-type; TP53mut, mutant). P-values were calculated using a two-sided chi-square test (a) and two-sided Student’s t-test (b, c). d–f, Kaplan–Meier survival curves stratified by TP53 mutation status and MATH score in the CUBRICS cohort with multivariate Cox regression analysis (d,e), and in the METABRIC cohort (f). In CUBRICS, MATHhigh was defined as MATH ≥ 40 and MATHlow as MATH < 40, whereas in METABRIC, MATHhigh and MATHlow were defined by the upper and lower quartiles, respectively. Both cohorts showed consistent associations between MATH, TP53 mutation status, and overall survival. Error bars in panel e represent 95% confidence intervals, with the centre defined as the hazard ratio. Notes: Box plots show median (line), first and third quartiles (box edges), and 1.5× the interquartile range (whiskers). In all panels, “n” indicates the number of patients included in the analysis.
Extended Data Fig. 4
Extended Data Fig. 4. Impact of APOBEC3A/3B germline deletion on mutational signatures and APOBEC gene expression.
a, Detection of APOBEC3A/3B germline deletion. Germline deletion status was determined using the depth ratio method, where R1 (d0/d1) represents the ratio of sequencing depth between the upstream 30 kbp region and the deletion region, and R2 (d0/d2) represents the ratio between the downstream 30 kbp region and the deletion region. b, Distribution of R1 and R2 values. The density plot shows distinct clustering of samples based on APOBEC3A/3B germline deletion status (wild-type, heterozygous deletion, and homozygous deletion), confirming that these metrics effectively differentiate deletion groups. c, RNA expression differences in APOBEC family genes by germline deletion status. Violin plots display the transcriptional impact of APOBEC3A/3B germline deletion on APOBEC gene expression levels, highlighting significant differences where applicable. Sample sizes: homozygous deletion (n = 121), heterozygous deletion (n = 539), wild-type (n = 549). d, Differences in mutational signatures by germline deletion status. Violin plots illustrate the number of mutations assigned to each COSMIC mutational signature across samples stratified by APOBEC3A/3B germline deletion status. Sample sizes: deletion (n = 736), wild-type (n = 628). Note: P-values were calculated using a two-sided Student’s t-test without adjustment for multiple comparisons. Box plots indicate median (middle line), first and third quartiles (edges).
Extended Data Fig. 5
Extended Data Fig. 5. Transcriptomic impact of structural variations spanning chr8:35-40 Mb and chr11:65-70 Mb in luminal B breast cancer.
a, Volcano plot displaying differentially expressed genes between luminal B breast cancer cases with structural variations (SVs) spanning chr8:35-40 Mb and chr11:65-70 Mb (right) and those without these SVs (left). The x-axis represents log2 fold-change in gene expression, while the y-axis represents the -log10(q-value). b, Gene Set Enrichment Analysis results showing pathways enriched in luminal B breast cancer cases with these SVs. Positive normalized enrichment scores (NES) indicate upregulated pathways, including TNF-α signaling via NF-κB, TGF-β signaling, and epithelial-mesenchymal transition, which are known to contribute to tumor progression and metastasis. Conversely, downregulated pathways include oxidative phosphorylation, glycolysis, and fatty acid metabolism, suggesting a metabolic shift in tumors harboring these SVs.
Extended Data Fig. 6
Extended Data Fig. 6. Structural characterization of recurrent fusions identified in our cohort.
Illustrated are fusions involving TTC6/MIPOL1, BCL2L14/ETV6, PRKCA/CEP112, ESR1/CCDC170, AGO2/PTK2, GALNT17/AUTS2, and BRD4/NOTCH3.
Extended Data Fig. 7
Extended Data Fig. 7. Structural characterization of recurrent fusions identified in our cohort.
Illustrated are fusions involving TRAPPC9/PTK2, UHRF1BP1L/ANKS1B, FBXL20/IKZF3, PRKCA/RGS9, DLG2/TENM4, IL34/SF3B3, AGO2/TRAPPC9, IMMP2L/DOCK4, SLC39A11/SDK2, KAT6A/ANK1, and IKZF3/ERBB2.
Extended Data Fig. 8
Extended Data Fig. 8. Structural variations (SVs) affecting COSMIC cancer gene census genes.
The left panel shows the number of cases with SVs in each gene, categorized by PAM50 molecular subtype. The right panel displays the percentage distribution of different SV types. The data highlight genes recurrently affected by SVs in breast cancer, with tumor suppressor genes (e.g., PTEN, RB1, and RUNX1) and oncogenes (e.g., NRG1, ERBB4, and ESR1) prominently impacted. These results suggest that SVs may contribute to the dysregulation of key cancer-related genes across different breast cancer subtypes.
Extended Data Fig. 9
Extended Data Fig. 9. Focal amplification of key oncogenes and predictive role of ERBB2 focal amplification in neoadjuvant anti-HER2 therapy response.
a, Segment length (Mbp) and copy number (CN) gain of ERBB2, CCND1, ZNF703, and FGFR1, classified as focal amplification (≤3 Mbp, relative CN gain >3 compared to surrounding regions, absolute CN ≥ 7), broad amplification (CN gain ≥1 without focal), or none. Vertical and horizontal lines mark thresholds for focal amplification. b, Relationship between log10(CN) and log10(RNA expression; TPM) for the four oncogenes. Density shading (red = highest) and yellow dots (individual tumors) shown. c, Correlation matrix of RNA expression: lower = density plots, diagonal = TPM distribution, upper = Pearson correlation with significance (*P < 0.05, **P < 0.01, ***P < 0.001). d, Genome-wide structural plot (Yilong plot) showing focal ERBB2 amplification (CN = 68) in a HER2 IHC 0 tumor; structural analysis confirms extrachromosomal DNA. e, HER2 IHC 3+ case without focal ERBB2 amplification, with ERBB2 TPM of 43.5, lower than typical IHC 3+ tumors. f–h, Association of ERBB2 focal amplification with pathologic complete response (pCR) to neoadjuvant trastuzumab + pertuzumab (TransNEO cohort). f, Amplicon width vs. major allele CN in 168 patients. g, CN profiles in eight treated patients: pCR cases (brown) show higher focal amplification; non-pCR cases (grey) lack it. h, Sankey plot: all without focal amplification failed pCR; 75% (3/4) with focal amplification responded. i–k, ERBB2 CN as predictor of pCR in 75 HER2-positive breast cancers treated with neoadjuvant TCHP. i, ROC curve (AUC = 0.819). j, CN distribution in responders vs. non-responders. k, Sankey plot showing pCR vs. non-pCR outcomes stratified by ERBB2 copy number (≥33) and chromothripsis status in the CUBRICS cohort. “n” indicates the number of patients in each analysis.
Extended Data Fig. 10
Extended Data Fig. 10. Long-segmental copy number changes in breast cancer.
Relative molecular timing of recurrent copy number amplifications (CNAs). From top to bottom: number of patients harboring each CNA; percentage of PAM50; integrative clusters (IntClust) molecular subtypes among samples with the respective CNA; box plots indicating median (middle line), first and third quartiles (edges) and 1.5x the interquartile range (whiskers); hazard ratio estimates with 95% confidence intervals; violin plots displaying the distribution of relative CNA timing (black dots represent the mean timing); and the G-score from GISTIC algorithm.
Extended Data Fig. 11
Extended Data Fig. 11. Prognostic impact of 9p23 amplification and copy number amplification timing in breast cancer.
a,b, Impact of 9p23 amplification on overall survival (OS) in TNBC within the METABRIC cohort. a, Kaplan-Meier survival curves comparing OS between patients with (n = 102) and without (n = 228) 9p23 amplification, demonstrating a poorer prognosis in those with amplification. b, Multivariate Cox regression analysis for OS, displaying hazard ratios and 95% confidence intervals (CI) for 9p23 amplification, tumor size (>50 mm vs. ≤50 mm), and lymph node metastasis status. A two-sided Wald test was performed without adjustment for multiple comparisons. Error bars indicate 95% confidence intervals, with the centre defined as the hazard ratio. c, Differential gene expression analysis based on 9p23 amplification status in basal-like breast cancer. Significantly upregulated genes in 9p23-amplified samples are shown in red, while those upregulated in non-amplified samples are shown in yellow. The x-axis represents log2 fold-change in expression, and the y-axis shows the -log10(q-value), indicating statistical significance. d, Copy number amplification timing in homologous recombination-proficient (HRP) and homologous recombination-deficient (HRD) samples. Each row represents a sample, and the x-axis represents the relative timing of amplification events. The color intensity indicates the number of amplification segments at a given time point, with darker shades representing a higher number of amplification events. e, Distribution of amplification duration in HRP and HRD samples. The violin plot compares the duration of amplification events, defined as the difference between the earliest and latest amplification times within a sample. Box plots indicate median (middle line), first and third quartiles (edges). The p-value was estimated by a two-sided Student’s t-test. Note: In all panels, “n” refers to the number of patients included in the analysis.
Extended Data Fig. 12
Extended Data Fig. 12. Gene set enrichment analysis (GSEA) of RNA expression data in homologous recombination-deficient (HRD) vs. homologous recombination-proficient (HRP) hormone receptor-positive breast cancer patients treated with CDK4/6 inhibitors as first-line palliative treatment.
The x-axis represents the normalized enrichment score (NES), indicating whether a pathway is upregulated (positive NES) or downregulated (negative NES) in HRD patients (n = 13) compared to HRP patients (n = 44). The y-axis lists the significantly enriched pathways. The size of each dot corresponds to -log10(q-value), with larger dots indicating stronger statistical significance. Pathways on the right (positive NES) are enriched in HRD tumors, whereas pathways on the left (negative NES) are enriched in HRP tumors.

References

    1. Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin.73, 17–48 (2023). - PubMed
    1. Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature578, 94–101 (2020). - DOI - PMC - PubMed
    1. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature534, 47–54 (2016). - DOI - PMC - PubMed
    1. Mroz, E. A. & Rocco, J. W. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral Oncol.49, 211–215 (2013). - DOI - PMC - PubMed
    1. Jallah, J. K., Dweh, T. J., Anjankar, A. & Palma, O. A review of the advancements in targeted therapies for breast cancer. Cureus15, e47847 (2023). - PMC - PubMed