Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2020 Sep;30(9):1274-1290.
doi: 10.1101/gr.256701.119. Epub 2020 Sep 4.

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Affiliations
Comparative Study

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples

Sergey Aganezov et al. Genome Res. 2020 Sep.

Abstract

Many cancer genomes are extensively rearranged with aberrant chromosomal karyotypes. Deriving these karyotypes from high-throughput DNA sequencing of bulk tumor samples is complicated because most tumors are a heterogeneous mixture of normal cells and subpopulations of cancer cells, or clones, that harbor distinct somatic mutations. We introduce a new algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes from DNA sequencing data from a bulk tumor sample. RCK leverages evolutionary constraints on the somatic mutational process in cancer to reduce ambiguity in the deconvolution of admixed sequencing data into multiple haplotype-specific cancer karyotypes. RCK models mixtures containing an arbitrary number of derived genomes and allows the incorporation of information both from short-read and long-read DNA sequencing technologies. We compare RCK to existing approaches on 17 primary and metastatic prostate cancer samples. We find that RCK infers cancer karyotypes that better explain the DNA sequencing data and conform to a reasonable evolutionary model. RCK's reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is freely available as open source software.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the RCK algorithm. The inputs to RCK (white dotted boxes) are clone- and allele-specific copy numbers (top left) and novel adjacencies (top right) from bulk tumor samples that are derived from alignments of DNA sequencing (top) reads using existing tools. The RCK algorithm (blue shaded elements) builds a diploid interval adjacency graph integrating copy number and novel adjacency information (for details, see Methods). RCK then solves a mixed-integer linear program (MILP) to find an optimal assignment of segment copy numbers and novel adjacencies to alleles and clones, subject to copy number balance on segment ends and satisfying evolutionary constraints from a generalized infinite sites model. Constraints on groups of novel adjacencies from the third-generation sequencing technologies may optionally be included. The outputs of RCK are clone- and haplotype-specific cancer genome karyotypes.
Figure 2.
Figure 2.
Results of RCK on simulated bulk tumor samples with two clones. (A) False negative rate (FNR) and false positive rate (FPR) of novel adjacencies used by RCK using adjacency utilization parameter P = 0.9 (RCK-0.9) and P = 0.75 (RCK-0.75). (B) Length-weighted segment copy number distances between input copy numbers (C¯~) and karyotypes inferred by RCK.
Figure 3.
Figure 3.
Comparison of RCK and ReMixT on heterogeneous prostate cancer samples. (A) Length-weighted segment copy number distances between segment copy numbers from HATCHet and segment copy numbers output by ReMixT and RCK. (B) Fractions of novel adjacencies from input that are inferred to be present by ReMixT or RCK for each sample in the heterogeneous group. RCK used segment copy numbers from HATCHet in input and novel adjacency utilization parameter P = 1.0, 0.9, 0.75, 0.5.
Figure 4.
Figure 4.
ReMixT karyotypes from heterogeneous prostate cancer samples have numerous violations of the generalized infinite sites constraints. In A, C, and E, solid edges represent segment edges, black-dashed edges represent reference adjacency edges, and red dashed edges represent novel adjacency edges. Integer values indicate copy numbers of corresponding segment and adjacency edges. (A) An intra-genome violation of the homologous-extremity-exclusivity constraint. To achieve copy number balance, both homologous vertices 2Ah and 2Bh from genome Gi must be involved in novel adjacencies. (B) Number of novel adjacencies that violate the intra-genome homologous-extremity-exclusivity constraint in each cancer karyotype inferred by ReMixT in each sample. (C) An inter-genome violation of the homologous-extremity-exclusivity constraint. To achieve copy number balance, both homologous vertices 2Ah and 2Bh (in different genomes) must be involved in novel adjacencies. (D) The fraction x/y, where x is the number of novel adjacencies that violate the inter-genome homologous-extremity-exclusivity constraint (on at least one of the extremities involved in a novel adjacency) in ReMixT karyotypes, and y is the total number of novel adjacencies reported by ReMixT as being present in both genomes. (E) A violation of the intra-genome homologous-reciprocal-extremity-exclusivity constraint. To achieve copy number balance, both homologous-reciprocal vertices 2Ah and 3Bt must be involved in novel adjacencies. Inter-genome violations of the homologous-reciprocal-extremity-exclusivity constraint are also possible (Supplemental Fig. S17). (F) Fraction x/y, where x is the number of reciprocal locations with violations of either intra- or inter-genome (or both) homologous-reciprocal-extremity-exclusivity constraint in ReMixT karyotypes; and y is the total number of reciprocal locations that both have novel adjacencies in ReMixT karyotypes.
Figure 5.
Figure 5.
Evidence of complex k-break (k ≥ 3) rearrangements in metastatic prostate cancer. (A) Two complex rearrangements across two genomes in a heterogeneous sample. A 5-break rearrangement that produced four novel adjacencies {a, b, c, d} involving five reference adjacencies (X, R, L, O, and M), with novel adjacency a not present in genome G2. A 3-break rearrangement that produced three novel adjacencies {e, f, j} involving three reference adjacencies (Y, Z, and T), with novel adjacency j not present G1. (B, top) A complex 5-break rearrangement on Chromosome 10 in the karyotype inferred by RCK on sample A31a. Only the four novel adjacencies, five reference adjacencies, and incident segments involved in the rearrangement are shown. Copy numbers ≤1 are omitted for clarity, and absent segments/adjacencies are shown as faded. (Bottom) The locations of the corresponding double-stranded DNA breakages for the 5-break on Chromosome 10, indicated as x|y for each reference adjacency {(x)h, (y)t}. Three reference adjacencies lie in/near genes: reference adjacency 102,756,[799|800] falls within the promoter region for gene LZTS2; reference adjacency 114,208,50[2|3] falls inside gene VTI1A; and reference adjacency 114,062,94[6|7] falls inside gene TECTB. (C) Number of complex k-break (k ≥ 3) rearrangements reported in RCK-reconstructed karyotypes using HATCHet and Battenberg copy number inputs with novel adjacency utilization parameter P = 0.9. Values of 0 are omitted for clarity.
Figure 6.
Figure 6.
Segments, extremities, and copy number profiles for genomes. (A) A diploid reference genome r containing two pairs of homologous chromosomes: A Chromosomes are dark blue and dark green, and the homologous B Chromosomes are light blue and light green. Chromosomes are partitioned into consecutive segments labeled 1 through 12. (B, top) Reference genome r is a collection of concatenations of segments; the “flat” end of segment j corresponds to the tail extremity jt, whereas the “pointy” end of each segment j corresponds to the head extremity jh. Dashed lines correspond to reference adjacencies between adjacent extremities. The set T(R)={1At,1Bt,5Ah,5Bh,6At,6Bt,12Ah,12Bh} of extremities is the telomere set. (Bottom) The diploid segment copy number profile CR = (a, b) for the genome R with colors (dark/light blue/green) corresponding to A/B labeled segments. (C, top) A derived genome g obtained via multiple large-scale rearrangements from the reference genome R. Red dashed lines correspond to novel adjacencies, for example, {3Ah,7Bh}. (Bottom) The diploid segment copy number profile CG = (a, b) for the genome g with colors (dark/light blue/green) corresponding to A/B labeled segments. The set T(G) of telomeres in the derived genome G is identical to the set T(R) of telomeres in the reference genome R.
Figure 7.
Figure 7.
Ambiguity and errors in inferring segment copy number (SCN) profiles for a heterogeneous sample S = (G1, G2) under different assumptions about the sample composition. (A) A two-genome proper sample S = (G1, G2): each genome GiS is depicted as collections of adjacent blocks (top), and the corresponding sequences of signed blocks (bottom). (B) The copy number profile c = [c1, c2, c3, c4] inferred under the assumption that the sample is homogeneous (i.e., comprised of a single derived genome) and the reference genome is haploid (i.e., each segment has only a single haplotype in the reference). Each value cj is the weighted average of the sums of haplotype-specific (or allele-specific) copy numbers ai,j+bi,j=c^i,j+či,j over the genomes GiS. (C) Allele-specific copy number profiles c^=[c^1,c^2,c^3,c^4] and č=[č1,č2,č3,č4] inferred under the assumption that the sample is homogeneous and the reference genome is diploid (i.e., each segment has two haplotypes labeled A and B). Here, the entries c^j and čj for segment j are averages (c^1,j+c^2,j)/2 and (č1,j+č2,j)/2 of genome- and allele-specific copy number values. Note that the vectors c^ and č do not preserve the true A/B label of each allele: dark blue are true counts of allele A and light blue are true counts of allele B. Here, segments 2 and 4 are flipped. (D) Genome-specific copy number profiles c1 = [c1,1, c1,2, c1,3, c1,4] and c2 = [c2,1, c2,2, c2,3, c2,4] inferred under the assumption that the sample is heterogeneous, but the reference genome is haploid. Here, the entry ci,j for a segment j and genome Gi is the sum c^i,j+či,j of allele-specific copy number values in a genome Gi. (E) Allele- and genome-specific copy number matrices C~=(C^=[c^1,c^2,,c^n]T,Č=[č1,č2,,čn]T) inferred under the assumption that the sample is heterogeneous and the reference genome is diploid. Segments 2 and 4 are flipped alleles: (č1,2,c^2,2)=(a1,2,b2,2) and (č1,4,c^2,4)=(a1,4,b2,4).
Figure 8.
Figure 8.
A DIAG D(R,A~N)=(V,E) constructed on a set {1,2,,12} of segments, and a set A(R)H(A~N) of adjacencies. The set A(R) corresponds to reference adjacencies in a diploid reference R shown in Figure 6B, and the set A~N={{3h,7h},{2h,9h},{4t,8t},{4h,4h},{5t,8h},{3t,10t},{6h,11t}} represents unlabeled novel adjacencies that were measured from a derived genome G shown in Figure 6C. Squares indicate telomere vertices T(G)=T(R)V, and circles are non-telomere vertices. Solid edges correspond to segment edges in ES, with dark blue/green edges corresponding to segments labeled A, and light blue/green edges corresponding to segments labeled B. Black-dashed edges are reference adjacency edges ER, and red-dotted edges are novel adjacency edges EN.
Figure 9.
Figure 9.
Derivation of extremities and novel adjacencies for input to RCK and ReMixT. (A) An example of derivation of coordinates that resembles a reciprocal signature in measured unlabeled novel adjacencies on a chromosome a. Positions p1 = (a, 100, + ) and p2 = (a, 107, − ) have reciprocal signature (i.e., |coord1coord2|=7<50 and str1=str2=+). Updated pair {p1=(a,103,+),p2=(a,104,)} of coordinates constitutes a reciprocal location. (B) An example of partitioning of a set F={f1,f2,f3,f4} of fragments from allele-specific copy number calls into a set S = {s1, s2, s3, s4, s5, s6, s7, s8} of segments. Extremities of segments in S correspond to either preprocessed coordinates of unlabeled novel adjacencies (e.g., s1h=p1,s2t=p2) or to the extremities of fragments in F (e.g., s3h=f2h,s4t=f3t).

References

    1. Aganezov S, Goodwin S, Sherman RM, Sedlazeck FJ, Arun G, Bhatia S, Lee I, Kirsche M, Wappel R, Kramer M, et al. 2020. Comprehensive analysis of structural variants in breast cancer genomes using single molecule sequencing. Genome Res (this issue). 10.1101/gr.260497.119 - DOI - PMC - PubMed
    1. Alekseyev MA, Pevzner PA. 2009. Breakpoint graphs and ancestral genome reconstructions. Genome Res 19: 943–957. 10.1101/gr.082784.108 - DOI - PMC - PubMed
    1. Aparicio S, Caldas C. 2013. The implications of clonal genome evolution for cancer medicine. N Engl J Med 368: 842–851. 10.1056/NEJMra1204892 - DOI - PubMed
    1. Avdeyev P, Jiang S, Aganezov S, Hu F, Alekseyev MA. 2016. Reconstruction of ancestral genomes in presence of gene gain and loss. J Comput Biol 23: 150–164. 10.1089/cmb.2015.0160 - DOI - PubMed
    1. Baca SC, Prandi D, Lawrence MS., Mosquera JM, Romanel A, Drier Y, Park K, Kitabayashi N, MacDonald TY, Ghandi M, et al. 2013. Punctuated evolution of prostate cancer genomes. Cell 153: 666–677. 10.1016/j.cell.2013.03.021 - DOI - PMC - PubMed

Publication types