Comparative Study

. 2020 Sep;30(9):1274-1290.

doi: 10.1101/gr.256701.119. Epub 2020 Sep 4.

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Sergey Aganezov¹, Benjamin J Raphael¹

Affiliations

PMID: 32887685
PMCID: PMC7545144
DOI: 10.1101/gr.256701.119

Comparative Study

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Sergey Aganezov et al. Genome Res. 2020 Sep.

. 2020 Sep;30(9):1274-1290.

doi: 10.1101/gr.256701.119. Epub 2020 Sep 4.

Authors

Sergey Aganezov¹, Benjamin J Raphael¹

Affiliation

¹ Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA.

PMID: 32887685
PMCID: PMC7545144
DOI: 10.1101/gr.256701.119

Abstract

Many cancer genomes are extensively rearranged with aberrant chromosomal karyotypes. Deriving these karyotypes from high-throughput DNA sequencing of bulk tumor samples is complicated because most tumors are a heterogeneous mixture of normal cells and subpopulations of cancer cells, or clones, that harbor distinct somatic mutations. We introduce a new algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes from DNA sequencing data from a bulk tumor sample. RCK leverages evolutionary constraints on the somatic mutational process in cancer to reduce ambiguity in the deconvolution of admixed sequencing data into multiple haplotype-specific cancer karyotypes. RCK models mixtures containing an arbitrary number of derived genomes and allows the incorporation of information both from short-read and long-read DNA sequencing technologies. We compare RCK to existing approaches on 17 primary and metastatic prostate cancer samples. We find that RCK infers cancer karyotypes that better explain the DNA sequencing data and conform to a reasonable evolutionary model. RCK's reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is freely available as open source software.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of the RCK algorithm. The inputs to RCK (white dotted boxes) are clone- and allele-specific copy numbers (*top left*) and novel adjacencies (*top right*) from bulk tumor samples that are derived from alignments of DNA sequencing (*top*) reads using existing tools. The RCK algorithm (blue shaded elements) builds a *diploid interval adjacency graph* integrating copy number and novel adjacency information (for details, see Methods). RCK then solves a mixed-integer linear program (MILP) to find an optimal assignment of segment copy numbers and novel adjacencies to alleles and clones, subject to copy number balance on segment ends and satisfying evolutionary constraints from a generalized infinite sites model. Constraints on groups of novel adjacencies from the third-generation sequencing technologies may optionally be included. The outputs of RCK are clone- and haplotype-specific cancer genome karyotypes.

**Figure 2.**
Results of RCK on simulated bulk tumor samples with two clones. (A) False negative rate (FNR) and false positive rate (FPR) of novel adjacencies used by RCK using adjacency utilization parameter P = 0.9 (RCK-0.9) and P = 0.75 (RCK-0.75). (B) Length-weighted segment copy number distances between input copy numbers ( ${\tilde{\bar{C}}}^{'}$ ) and karyotypes inferred by RCK.

**Figure 3.**
Comparison of RCK and ReMixT on heterogeneous prostate cancer samples. (A) Length-weighted segment copy number distances between segment copy numbers from HATCHet and segment copy numbers output by ReMixT and RCK. (B) Fractions of novel adjacencies from input that are inferred to be present by ReMixT or RCK for each sample in the heterogeneous group. RCK used segment copy numbers from HATCHet in input and novel adjacency utilization parameter P = 1.0, 0.9, 0.75, 0.5.

**Figure 4.**
ReMixT karyotypes from heterogeneous prostate cancer samples have numerous violations of the generalized infinite sites constraints. In A, C, and E, solid edges represent segment edges, black-dashed edges represent reference adjacency edges, and red dashed edges represent novel adjacency edges. Integer values indicate copy numbers of corresponding segment and adjacency edges. (A) An intra-genome violation of the homologous-extremity-exclusivity constraint. To achieve copy number balance, both homologous vertices $2_{A}^{h}$ and $2_{B}^{h}$ from genome G_i must be involved in novel adjacencies. (B) Number of novel adjacencies that violate the intra-genome homologous-extremity-exclusivity constraint in each cancer karyotype inferred by ReMixT in each sample. (C) An inter-genome violation of the homologous-extremity-exclusivity constraint. To achieve copy number balance, both homologous vertices $2_{A}^{h}$ and $2_{B}^{h}$ (in different genomes) must be involved in novel adjacencies. (D) The fraction x/y, where x is the number of novel adjacencies that violate the inter-genome homologous-extremity-exclusivity constraint (on at least one of the extremities involved in a novel adjacency) in ReMixT karyotypes, and y is the total number of novel adjacencies reported by ReMixT as being present in both genomes. (E) A violation of the intra-genome homologous-reciprocal-extremity-exclusivity constraint. To achieve copy number balance, both homologous-reciprocal vertices $2_{A}^{h}$ and $3_{B}^{t}$ must be involved in novel adjacencies. Inter-genome violations of the homologous-reciprocal-extremity-exclusivity constraint are also possible (Supplemental Fig. S17). (F) Fraction x/y, where x is the number of reciprocal locations with violations of either intra- or inter-genome (or both) homologous-reciprocal-extremity-exclusivity constraint in ReMixT karyotypes; and y is the total number of reciprocal locations that both have novel adjacencies in ReMixT karyotypes.

**Figure 5.**
Evidence of complex k-break (k ≥ 3) rearrangements in metastatic prostate cancer. (A) Two complex rearrangements across two genomes in a heterogeneous sample. A 5-break rearrangement that produced four novel adjacencies {a, b, c, d} involving five reference adjacencies (X, R, L, O, and M), with novel adjacency a not present in genome G₂. A 3-break rearrangement that produced three novel adjacencies {e, f, j} involving three reference adjacencies (Y, Z, and T), with novel adjacency j not present G₁. (B, *top*) A complex 5-break rearrangement on Chromosome 10 in the karyotype inferred by RCK on sample A31a. Only the four novel adjacencies, five reference adjacencies, and incident segments involved in the rearrangement are shown. Copy numbers ≤1 are omitted for clarity, and absent segments/adjacencies are shown as faded. (*Bottom*) The locations of the corresponding double-stranded DNA breakages for the 5-break on Chromosome 10, indicated as x|y for each reference adjacency {(x)^h, (y)^t}. Three reference adjacencies lie in/near genes: reference adjacency 102,756,[799|800] falls within the promoter region for gene *LZTS2*; reference adjacency 114,208,50[2|3] falls inside gene *VTI1A*; and reference adjacency 114,062,94[6|7] falls inside gene *TECTB*. (C) Number of complex k-break (k ≥ 3) rearrangements reported in RCK-reconstructed karyotypes using HATCHet and Battenberg copy number inputs with novel adjacency utilization parameter P = 0.9. Values of 0 are omitted for clarity.

**Figure 6.**
Segments, extremities, and copy number profiles for genomes. (A) A diploid reference genome r containing two pairs of homologous chromosomes: A Chromosomes are dark blue and dark green, and the homologous B Chromosomes are light blue and light green. Chromosomes are partitioned into consecutive segments labeled 1 through 12. (B, *top*) Reference genome r is a collection of concatenations of segments; the “flat” end of segment j corresponds to the tail extremity j^t, whereas the “pointy” end of each segment j corresponds to the head extremity j^h. Dashed lines correspond to reference adjacencies between adjacent extremities. The set $T (R) = {1_{A}^{t}, 1_{B}^{t}, 5_{A}^{h}, 5_{B}^{h}, 6_{A}^{t}, 6_{B}^{t}, 12_{A}^{h}, 12_{B}^{h}}$ of extremities is the telomere set. (*Bottom*) The diploid segment copy number profile C_R = (a, b) for the genome R with colors (dark/light blue/green) corresponding to A/B labeled segments. (C, *top*) A derived genome g obtained via multiple large-scale rearrangements from the reference genome R. Red dashed lines correspond to novel adjacencies, for example, ${3_{A}^{h}, 7_{B}^{h}}$ . (*Bottom*) The diploid segment copy number profile C_G = (a, b) for the genome g with colors (dark/light blue/green) corresponding to A/B labeled segments. The set $T (G)$ of telomeres in the derived genome G is identical to the set $T (R)$ of telomeres in the reference genome R.

**Figure 7.**
Ambiguity and errors in inferring segment copy number (SCN) profiles for a heterogeneous sample S = (G₁, G₂) under different assumptions about the sample composition. (A) A two-genome proper sample S = (G₁, G₂): each genome G_i ∈ S is depicted as collections of adjacent blocks (*top*), and the corresponding sequences of signed blocks (*bottom*). (B) The copy number profile c = [c₁, c₂, c₃, c₄] inferred under the assumption that the sample is homogeneous (i.e., comprised of a single derived genome) and the reference genome is *haploid* (i.e., each segment has only a single haplotype in the reference). Each value c_j is the weighted average of the sums of haplotype-specific (or allele-specific) copy numbers $a_{i, j} + b_{i, j} = {\hat{c}}_{i, j} + č_{i, j}$ over the genomes G_i ∈ S. (C) Allele-specific copy number profiles $\hat{c} = [{\hat{c}}_{1}, {\hat{c}}_{2}, {\hat{c}}_{3}, {\hat{c}}_{4}]$ and $č = [č_{1}, č_{2}, č_{3}, č_{4}]$ inferred under the assumption that the sample is homogeneous and the reference genome is *diploid* (i.e., each segment has two haplotypes labeled A and B). Here, the entries ${\hat{c}}_{j}$ and $č_{j}$ for segment j are averages $({\hat{c}}_{1, j} + {\hat{c}}_{2, j}) / 2$ and $(č_{1, j} + č_{2, j}) / 2$ of genome- and allele-specific copy number values. Note that the vectors $\hat{c}$ and $č$ do not preserve the true A/B label of each allele: dark blue are true counts of allele A and light blue are true counts of allele B. Here, segments 2 and 4 are *flipped*. (D) Genome-specific copy number profiles c₁ = [c_1,1, c_1,2, c_1,3, c_1,4] and c₂ = [c_2,1, c_2,2, c_2,3, c_2,4] inferred under the assumption that the sample is heterogeneous, but the reference genome is haploid. Here, the entry c_i,j for a segment j and genome G_i is the sum ${\hat{c}}_{i, j} + č_{i, j}$ of allele-specific copy number values in a genome G_i. (E) Allele- and genome-specific copy number matrices $\tilde{C} = (\hat{C} = {[{\hat{c}}_{1}, {\hat{c}}_{2}, \dots, {\hat{c}}_{n}]}^{T}, Č = {[č_{1}, č_{2}, \dots, č_{n}]}^{T})$ inferred under the assumption that the sample is heterogeneous and the reference genome is diploid. Segments 2 and 4 are flipped alleles: $(č_{1, 2}, {\hat{c}}_{2, 2}) = (a_{1, 2}, b_{2, 2})$ and $(č_{1, 4}, {\hat{c}}_{2, 4}) = (a_{1, 4}, b_{2, 4})$ .

**Figure 8.**
A DIAG $D (R, {\tilde{A}}_{N}) = (V, E)$ constructed on a set ${1, 2, \dots, 12}$ of segments, and a set $A (R) \cup H ({\tilde{A}}_{N})$ of adjacencies. The set $A (R)$ corresponds to reference adjacencies in a diploid reference R shown in Figure 6B, and the set ${\tilde{A}}_{N} = {{3^{h}, 7^{h}}, {2^{h}, 9^{h}}, {4^{t}, 8^{t}}, {4^{h}, 4^{h}}, {5^{t}, 8^{h}}, {3^{t}, 10^{t}}, {6^{h}, 11^{t}}}$ represents unlabeled novel adjacencies that were measured from a derived genome G shown in Figure 6C. Squares indicate telomere vertices $T (G) = T (R) \subseteq V$ , and circles are non-telomere vertices. Solid edges correspond to segment edges in E_S, with dark blue/green edges corresponding to segments labeled A, and light blue/green edges corresponding to segments labeled B. Black-dashed edges are reference adjacency edges E_R, and red-dotted edges are novel adjacency edges E_N.

**Figure 9.**
Derivation of extremities and novel adjacencies for input to RCK and ReMixT. (A) An example of derivation of coordinates that resembles a reciprocal signature in measured unlabeled novel adjacencies on a chromosome a. Positions p₁ = (a, 100, + ) and p₂ = (a, 107, − ) have reciprocal signature (i.e., $| coor d_{1} - coor d_{2} | = 7 < 50$ and $st r_{1} = - \neq st r_{2} = +$ ). Updated pair ${{p^{'}}_{1} = (a, 103, +), {p^{'}}_{2} = (a, 104, -)}$ of coordinates constitutes a reciprocal location. (B) An example of partitioning of a set $F = {f_{1}, f_{2}, f_{3}, f_{4}}$ of fragments from allele-specific copy number calls into a set S = {s₁, s₂, s₃, s₄, s₅, s₆, s₇, s₈} of segments. Extremities of segments in S correspond to either preprocessed coordinates of unlabeled novel adjacencies (e.g., $s_{1}^{h} = p_{1}^{'}$ , $s_{2}^{t} = p_{2}^{'}$ ) or to the extremities of fragments in $F$ (e.g., $s_{3}^{h} = f_{2}^{h}$ , $s_{4}^{t} = f_{3}^{t}$ ).

See this image and copyright information in PMC

References

1. Aganezov S, Goodwin S, Sherman RM, Sedlazeck FJ, Arun G, Bhatia S, Lee I, Kirsche M, Wappel R, Kramer M, et al. 2020. Comprehensive analysis of structural variants in breast cancer genomes using single molecule sequencing. Genome Res (this issue). 10.1101/gr.260497.119 - DOI - PMC - PubMed
1. Alekseyev MA, Pevzner PA. 2009. Breakpoint graphs and ancestral genome reconstructions. Genome Res 19: 943–957. 10.1101/gr.082784.108 - DOI - PMC - PubMed
1. Aparicio S, Caldas C. 2013. The implications of clonal genome evolution for cancer medicine. N Engl J Med 368: 842–851. 10.1056/NEJMra1204892 - DOI - PubMed
1. Avdeyev P, Jiang S, Aganezov S, Hu F, Alekseyev MA. 2016. Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol 23: 150–164. 10.1089/cmb.2015.0160 - DOI - PubMed
1. Baca SC, Prandi D, Lawrence MS., Mosquera JM, Romanel A, Drier Y, Park K, Kitabayashi N, MacDonald TY, Ghandi M, et al. 2013. Punctuated evolution of prostate cancer genomes. Cell 153: 666–677. 10.1016/j.cell.2013.03.021 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Affiliation

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical