Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;646(8083):190-197.
doi: 10.1038/s41586-025-09398-w. Epub 2025 Aug 13.

Clone copy number diversity is linked to survival in lung cancer

Affiliations

Clone copy number diversity is linked to survival in lung cancer

Piotr Pawlik et al. Nature. 2025 Oct.

Abstract

Both single nucleotide variants (SNVs) and somatic copy number alterations (SCNAs) accumulate in cancer cells during tumour development, fuelling clonal evolution. However, accurate estimation of clone-specific copy numbers from bulk DNA-sequencing data is challenging. Here we present allele-specific phylogenetic analysis of copy number alterations (ALPACA), a method to infer SNV and SCNA coevolution by leveraging phylogenetic trees reconstructed from multi-sample bulk tumour sequencing data using SNV frequencies. ALPACA estimates the SCNA evolution of simulated tumours with a higher accuracy than current state-of-the-art methods1-4. ALPACA uncovers loss-of-heterozygosity and amplification events in minor clones that may be missed using standard approaches and reveals the temporal order of somatic alterations. Analysing clone-specific copy numbers in TRACERx421 lung tumours5,6, we find evidence of increased chromosomal instability in metastasis-seeding clones and enrichment for losses affecting tumour suppressor genes and amplification affecting CCND1. Furthermore, we identify increased SCNA rates in both tumours with polyclonal metastatic dissemination and tumours with extrathoracic metastases, and an association between higher clone copy number diversity and reduced disease-free survival in patients with lung cancer.

PubMed Disclaimer

Conflict of interest statement

Competing interests: A. Hackshaw has received fees for being a member of independent data monitoring committees for Roche-sponsored clinical trials and academic projects coordinated by Roche. N.M. holds patents related to determining human leukocyte antigen (HLA) LOH (PCT/GB2018/052004), determination of B cell fraction in mixed samples (PCT/EP2024/062999), determination of lymphocyte abundance in mixed samples (PCT/EP2022/070694), identifying responders to cancer treatment (PCT/GB2018/051912), targeting neoantigens (PCT/EP2016/059401), identifying patient response to immune checkpoint blockade (PCT/EP2016/071471) and predicting survival rates of patients with cancer (PCT/GB2020/050221) and has a patent pending in determining HLA disruption (PCT/EP2023/059039). C.S. acknowledges grant support from AstraZeneca, Boehringer-Ingelheim, Bristol Myers Squibb, Pfizer, Invitae (previously Archer Dx—collaboration in minimal residual disease sequencing technologies), Ono Pharmaceutical and Personalis; is chief investigator for the AZ MeRmaiD 1 and 2 clinical trials and is the Steering Committee chair; is co-chief investigator of the NHS Galleri trial financed by GRAIL and a paid member of GRAIL’s Scientific Advisory Board (SAB); receives consultant fees from Achilles (and is also an SAB member), Bicycle Therapeutics (SAB member and chair of Clinical Advisory Group), Genentech, Medicxi, China Innovation Centre of Roche (formerly Roche Innovation Centre, Shanghai), Metabomed (until July 2022), Relay Therapeutics (SAB member), Saga Diagnostics (SAB member) and the Sarah Cannon Research Institute; has received honoraria from Amgen, AstraZeneca, Bristol Myers Squibb, GlaxoSmithKline, Illumina, MSD, Novartis and Pfizer; has previously held stock or options in GRAIL, has stock or options in Bicycle Therapeutics and Relay Therapeutics at present, and has stock and is co-founder of Achilles Therapeutics; declares patent applications for methods for lung cancer detection (PCT/US2017/028013, US20190106751A1), methods for targeting neoantigens (PCT/EP2016/059401), methods for identifying patient response to immune checkpoint blockade (PCT/EP2016/071471), methods for identifying patients who respond to cancer treatment (PCT/GB2018/051912), methods for determining HLA LOH (PCT/GB2018/052004), methods for predicting survival rates of patients with cancer (PCT/GB2020/050221), methods and systems for tumour monitoring (PCT/EP2022/077987), and methods for analysis of HLA allele transcriptional deregulation (PCT/EP2023/059039); and is an inventor on a European patent application (PCT/GB2017/053289) relating to assay technology to detect tumour recurrence (this patent has been licensed to a commercial entity, and under their terms of employment, C.S. is due a revenue share of any revenue generated from such licence(s)). A.M.F. is a co-inventor on a patent application to determine methods and systems for tumour monitoring (PCT/EP2022/077987). J.F. is a member of the SAB at DAiNA.

Figures

Fig. 1
Fig. 1. Overview of ALPACA algorithm.
a, Input: observed fractional copy numbers Fˆ (where F denotes the average copy number of all cancer cells in the sample) for each genomic segment (Seg.), tumour phylogeny and clone proportions, U. The fraction of the octagonal cloneMap covered by each colour represents the clone proportion in that sample. VAF, variant allele frequency. b, The model aims to infer integer clone copy numbers, C, by optimizing two objectives: minimizing the number of samples with a predicted fractional copy number outside the confidence intervals of Fˆ; and minimizing the distance between the predicted and observed fractional copy numbers. The linear optimization is subject to evolutionary constraints, including persistent LOH on each tree path. c, The optimal solution is selected on the basis of a model selection procedure. d, ALPACA outputs clone- and allele-specific integer copy-number profiles for each genomic segment.
Fig. 2
Fig. 2. Benchmarking ALPACA.
a, Accuracy comparison between ALPACA, HATCHet and cloneHD on MASCoTE simulations (n = 64 simulations). b, True (0–4, left) and predicted (0–4, right) copy numbers in simulated dataset. Values to the right of each flow represent the number and fraction of true copy numbers per clone- and allele-specific segment predicted correctly. c, Comparison of SCNA inference between CONIPHER + ALPACA and HATCHet2 + MEDICC2 (n = 150 simulations). d, Comparison of SCNA inference between CONIPHER + ALPACA and HATCHet2 + MEDICC2 applied to experimental WES dataset (n = 707 genomic segments). e, Heat maps showing reconstructed total copy-number profiles of the experimental dataset in Fig. 2d using single cell, ALPACA and HATCHet2. All tests in this figure are two-sided paired Wilcoxon. In data represented using box plots, the box represents the interquartile range (IQR) with the median line. Whiskers denote the lowest and highest values within 1.5 times the IQR from the first and third quartiles, respectively.
Fig. 3
Fig. 3. ALPACA provides additional clone-level SCNA resolution.
a,b, ALPACA input and output for Tx421-PM cases CRUK0048 (a) and CRUK0022 (b). First two panels show input clone proportions in samples and bulk fractional copy numbers. The third panel shows the tumour phylogenetic tree annotated with ALPACA-predicted clone copy-number output. The green shaded clones are the primary-unique clones; the purple shaded clones are seeding and metastatic clones. The seeding clone node has a darker black outline on the tree. Mb, megabases.
Fig. 4
Fig. 4. Timing of SNV and SCNA events in LUAD and LUSC.
a, Method for timing somatic alterations using a Bradley–Terry model. amp., amplification. b,c, Bradley–Terry relative ranking estimate for the most frequent events in Tx421-P LUAD (n = 225; b) and LUSC (n = 126; c) cohorts. Plots include the following annotations: the distribution of the mean and maximum (max.) phylogenetic cancer cell fraction (PhyloCCF) of the mutation clusters corresponding to the tree nodes that each event was assigned to, and the number of edges each event was assigned to, coloured by whether the edge was truncal or subclonal.
Fig. 5
Fig. 5. SCNA patterns in metastasis.
a, Schematic example of a phylogenetic tree showing clone classes accompanied by (1) single-allele ALPACA output for four genomic segments, (2) segment-specific copy number changes on each edge and (3) aggregation of segment-specific copy number changes to interval events per edge. CN, copy number. b, Cumulative number of SNVs on each edge versus the cumulative number of SCNAs for each phylogenetic tree in the Tx421-PM cohort. Numbers above trees represent the Cancer Research UK identifier. Patients are split by histology and branches are coloured on the basis of the clone classifications. c, Box plot comparing the number of SCNAs in seeding (n = 121) versus non-seeding (n = 434) primary clones (two-sided Wilcoxon test). The box represents the interquartile range (IQR) with the median line. Whiskers denote the lowest and highest values within 1.5 times the IQR from the first and third quartiles, respectively.
Fig. 6
Fig. 6. CCD predicts poor patient survival.
a, CCD comparison (two-sided Wilcoxon) between tumours of patients with (n = 117 tumours) and without (n = 126 tumours, 3-yr follow-up period) metastatic disease. b, CCD comparison between tumours with only subclonal (n = 73 tumours) or truncal (n = 44 tumours) seeding. c,d, Kaplan–Meier curves showing difference in DFS between patients with tumours with greater or less than the median value of CCD (median CCD = 21.95; c) and with high, mid or low tertile values of CCD (d). The number of patients at risk in each group is indicated below each time point. CI, confidence interval. e, A multivariable Cox proportional hazards model including covariates age, stage, pack-years, histology, sex, adjuvant treatment status, CCD (increase per standard deviation) and fraction of the aberrant genome with subclonal SCNAs (SCNA-ITH (sample), increase per standard deviation). The measure of centre represents the hazard ratio estimate and its 95% confidence intervals are indicated in parentheses and represented by the error bars. All survival analyses were performed on n = 387 patients. pTNM, pathological tumour–node–metastasis; AIC, Akaike information criterion. *P < 0.05; **P < 0.01; ***P < 0.001. In data represented using box plots, the box represents the interquartile range (IQR) with the median line. Whiskers denote the lowest and highest values within 1.5 times the IQR from the first and third quartiles, respectively.
Extended Data Fig. 1
Extended Data Fig. 1. ALPACA assumptions.
a-c, Variation of clustering consistency with varied SNV and CNA mutation rates, (a) assuming neutral growth, (b) with selection on CNAs only, (c) and with selection on both CNAs and SNVs. For all pairs of mutation rates, consistency values are averaged across 10 simulations. d, Clustering concordance of mutations from TRACERx primary cohort cases between the full set of mutations and subsets. Each box plot represents one subset (n = 310), where 5%, 15%, 25%, or 50% of the genome with extensive SCNA has been removed.
Extended Data Fig. 2
Extended Data Fig. 2. CPU runtime per tumour in TRACERx421 cohort.
Scatter plots showing the total CPU runtime (in hours) in TRACERx421 cohort per tumour against number of clones (upper left), number of segments (upper right), ploidy (bottom left) and purity (bottom right). Spearman’s Rho and P values are annotated on each plot. Each point represents one tumour (n = 395).
Extended Data Fig. 3
Extended Data Fig. 3. Benchmarking ALPACA.
a, Accuracy comparison between ALPACA and HATCHet2 using MASCoTE simulations (n = 8 simulated tumours). Each data point represents one simulated tumour (Wilcoxon signed-rank test), b, Overview of the simulated cohort. c, Comparison of results between ALPACA and TUSV-ext applied to the simulated dataset (n = 88 simulated tumours) using total variation distance (Wilcoxon signed-rank test). d, Comparison of results between ALPACA and TUSV-ext applied to simulated dataset (n = 88 simulated tumours) using mean Hamming distance between true copy number profile of each clone and most similar (based on Hamming distance) copy number profile predicted by each model (Wilcoxon signed-rank test). e, Clone proportions in each sample of the analysed single cell breast tumour case. f, Sankey plot showing the predicted vs true copy number states in the single cell breast tumour experiment. Values in each flow represent the number and fraction of true copy number states (0–4, left) per clone- and allele-specific segment predicted correctly (0–5+ or 4, right). g, ALPACA output for single cell breast tumour case. Top: allele A, bottom: allele B. Each row represents one clone. The first column contains a phylogenetic tree. Heatmap in the middle shows the true and predicted copy number states. Each clone track is divided into sections: upper (showing predicted copy numbers) and lower (showing true copy numbers). True copy number states for the trunk were not available. Heatmap (right) represents clone proportions in each sample (B, C, D and E), pie charts represent the fraction of correctly (blue) and incorrectly (red) solved segments per clone. Since the true data for the trunk was not available, pie charts show ‘NA’ for this clone.
Extended Data Fig. 4
Extended Data Fig. 4. ALPACA vs simple model.
a, ALPACA output for simulated case SIM127, allele A. Each row represents one clone. The first column contains a phylogenetic tree. Heatmap in the middle shows the true (top part of each track) and predicted (bottom part of each track) copy number states for each segment of each clone. Heatmap on the right hand side represents clone proportions in each sample, accompanied by pie charts representing the fraction of correctly (blue) and incorrectly (red) solved segments per clone. b, As in a but the solution was obtained with the simple model.
Extended Data Fig. 5
Extended Data Fig. 5. Example output.
a-b, ALPACA solution for alleles A and B respectively for tumour case CRUK0628. Each row represents one clone. The first column contains a phylogenetic tree with TP53 driver mutation annotated in the clone where it occurred. Heatmap in the middle shows predicted copy number states for each segment of each clone. Heatmap on the right hand side represents clone proportions in each sample. c, Cross-genome plot for CRUK0628 case with tree (left hand side) and clone proportions heatmap (right hand side) as in a and b. Tracks in the middle show copy numbers for both alleles, highlighting the difference between child and parent clones. d, Left: correlation between fraction of arm level LOH called by ALPACA and CharmTSG scores (Pearson R = 0.391, P = 0.014). Centre: correlation between fraction of arm level LOH called by ALPACA, excluding LOH event called by sample-level analysis (i.e. where fractional copy-number was below 0.5) and CharmTSG scores (Pearson R = 0.348, P = 0.03). Right: correlation between fraction of arm level LOH called by sample-level analysis (i.e. where fractional copy-number was below 0.5) and CharmTSG scores (Pearson R = 0.354, P = 0.027).
Extended Data Fig. 6
Extended Data Fig. 6. ALPACA enables evolutionary ordering of subclonal SNV and SCNA events in NSCLC.
a, Comparing the proportions of subclonal occurrences of the top most frequent events from the Tx421-P cohort in expanded vs non-expanded subclones, compared with a background calculated as the total number of subclonal event occurrences of each event type (chromosome arm amplification, chromosome arm LOH, gene amplification and driver SNV) in expanded vs non-expanded subclones. Significant P values are annotated on the plot (Fisher’s exact test). b, c Pairwise event ordering and Bradley-Terry relative ranking estimate for the most frequent events in Tx421-PM b), d) LUAD (n = 65), and c), e) LUSC (n = 38) cohorts. The top most frequent arm-level SCNAs, focal SCNAs, driver SNVs and metastatic seeding are included. Plots include the following annotations: the distribution of the mean and maximum phylogenetic cancer cell fraction (PhyloCCF) of the mutation clusters corresponding to the tree nodes that each event was assigned to, and the number of edges each event was assigned to, coloured by whether the edge was truncal or subclonal.
Extended Data Fig. 7
Extended Data Fig. 7. SCNA patterns in metastasis 1.
a, Normalised number of SNVs versus normalised number of SCNAs (Pearson’s correlation) for every subclonal edge on the phylogenetic tree in the TRACERx421 primary (left) and primary-metastatic (right) cohorts. b, Estimated coefficients and p-value from a mixed-effects linear regression predicting dependent variable number of SCNAs on each tree edge (#SCNAs) from fixed explanatory variables: number of SNVs (#SNVs), metastatic clone class, histology, and random effect tumour ID c, The total number of SCNA events (quantified as number of gain breakpoints + loss breakpoints separating the parent and child node), number of gains and number of losses (Methods) is compared between clones of different classes in the metastatic transition, for tumours in which the MRCA was non-seeding (n = 76 tumours, with the following number of clones per class: Non-seeding MRCA n = 76, Shared n = 56, Primary n = 697, Metastatic n = 242, Seeding n = 121). Wilcoxon test significant P values are marked with red colour. d, Estimated coefficients and p-value from a mixed-effects logistic regression predicting binary dependent variable whether a clone is metastasis-unique from fixed explanatory variables: number of SNVs, number of gains, number of losses, histology, and random effect tumour ID. e, Fraction of seeding (purple) and non-seeding (green) clones with gain (top) or loss (bottom) at each genomic locus. f, Boxplot comparing the number of SCNAs/number of SNVs for seeding clones (n = 121) versus non seeding clones (n = 434), for tumours where the MRCA is not seeding (n = 76, two sided Wilcoxon).
Extended Data Fig. 8
Extended Data Fig. 8. SCNA patterns in metastasis 2.
a, Left: Example CRUK trees with nodes labelled by their clone class. The box around case CRUK0083 indicates an example of a tumour case that has (1) subclonal seeding only and (2) non-seeding tree paths present. Right: The number of tumours for which both seeding and non-seeding tree paths are present is shown in a pie chart, as a subset of the full TRACERx421 primary-metastasis cohort. b, Chromosome 11 plot showing the number of tumours with gain of > 1 copy occurring on tree paths between the MRCA and the seeding clones (purple) and occurring on tree paths between the MRCA and non-seeding clones (green). c. Panels show the proportion of tumours affected by different SCNA events in seeding vs non-seeding paths (purple and green, respectively) compared to background of all genes of these types with 5 such events (Methods). Oncogenes are considered for gains, TSGs are considered for losses. The background counts of such events in all genes are shown as the top entry in each panel. P values are annotated only for genes with P < 0.05. Fisher’s exact test. d. Left: Odds of occurrence of each of the event types on seeding paths vs non-seeding paths for oncogenes and TSGs compared to background of all genes. Right: Comparing %SCNA events affecting all genes (with 5 such events) in seeding paths, split by SCNA event type and gene driver class. LOH (0|any state) describes the clone copy number state of 0 copies in one allele, and any copy number state in the other. LOH (0 | 1 state) describes the clone copy number state of 0 in one allele and 1 in the other (two sided Wilcoxon). For the purpose of discovery, results in c and d are presented without multiple testing correction.
Extended Data Fig. 9
Extended Data Fig. 9. Modes of dissemination.
a, Box plot comparing SCNA-ITH (sample level analysis, left) and mean number of SCNAs across all tree edges (clone-level analysis, right) for tumours in the TRACERx421 primary cohort between tumours exhibiting monoclonal (n = 86) and polyclonal (n = 40) dissemination patterns. Each point represents one tumour. b, Box plots comparing number of SCNA events per edge split between monoclonal and polyclonal seeding tumours, grouped by tree level Total number of clones in monoclonal and polyclonal seeding tumours are n = 1170, and n = 677, respectively. c, Box plot comparison of SCNA count per clone in patients with exclusively extrathoracic (n = 20) and exclusively intrathoracic (n = 36) seeding. d, Boxplot comparison of number of SCNAs, number of gains and number of losses per clone in patients with exclusively extrathoracic (n = 20) and exclusively intrathoracic (n = 36) seeding. All tests in this figure are two-sided Wilcoxon.
Extended Data Fig. 10
Extended Data Fig. 10. Clone copy number diversity predicts poor patient survival.
a, Diagram describing the calculation of the clone copy number diversity metric (CCD). b, Inferred clone copy number from allele A from example cases CRUK0717, which was classified as having high clone copy number diversity, and CRUK0003, which was classified as having low clone copy number diversity. c, Kaplan-Meier curve showing the difference in DFS between patients harbouring tumours with greater (red) or less (blue) than the median value of SCNA-ITH (sample) (median = 0.497). The number of patients at risk in each group is indicated below each timepoint. d, Kaplan-Meier curve showing the difference in DFS between tumours with high, mid, or low values of SCNA-ITH (sample). e, Kaplan-Meier curve showing the difference in DFS between patients diagnosed with lung invasive adenocarcinoma harbouring tumours with greater or less than the median value of CCD (median = 20.2). f, Kaplan-Meier curve showing the difference in DFS between patients diagnosed with lung squamous cell carcinoma harbouring tumours with greater or less than the median value of CCD (median = 23.5). g, Correlation between SCNA-ITH (sample) and CCD for tumours within each TNM stage. h, Z-normalised density of SCNA-ITH (sample) and CCD values across the patient cohort, by TNM tumour stage.

References

    1. Fu, X., Lei, H., Tao, Y. & Schwartz, R. Reconstructing tumor clonal lineage trees incorporating single-nucleotide variants, copy number alterations and structural variations. Bioinformatics38, i125–i133 (2022). - DOI - PMC - PubMed
    1. Zaccaria, S. & Raphael, B. J. Accurate quantification of copy-number aberrations and whole-genome duplications in multi-sample tumor sequencing data. Nat. Commun.11, 4301 (2020). - DOI - PMC - PubMed
    1. Myers, M. A. et al. HATCHet2: clone- and haplotype-specific copy number inference from bulk tumor sequencing data. Genome Biol.25, 130 (2024). - PMC - PubMed
    1. Kaufmann, T. L. et al. MEDICC2: whole-genome doubling aware copy-number phylogenies for cancer evolution. Genome Biol.23, 241 (2022). - DOI - PMC - PubMed
    1. Frankell, A. M. et al. The evolution of lung cancer and impact of subclonal selection in TRACERx. Nature616, 525–533 (2023). - DOI - PMC - PubMed