. 2023 Nov;4(11):1561-1574.

doi: 10.1038/s43018-023-00643-7. Epub 2023 Oct 2.

Centuries of genome instability and evolution in soft-shell clam, Mya arenaria, bivalve transmissible neoplasia

Samuel F M Hart^{1

2}, Marisa A Yonemitsu^{1

2}, Rachael M Giersch¹, Fiona E S Garrett¹, Brian F Beal^{3

4}, Gloria Arriagada^{5

6}, Brian W Davis^{7

8}, Elaine A Ostrander⁹, Stephen P Goff^{10

11}, Michael J Metzger^{12

13}

Affiliations

¹ Pacific Northwest Research Institute, Seattle, WA, USA.
² Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
³ Division of Environmental and Biological Sciences, University of Maine at Machias, Machias, ME, USA.
⁴ Downeast Institute, Beals, ME, USA.
⁵ Instituto de Ciencias Biomedicas, Facultad de Medicina y Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile.
⁶ FONDAP Center for Genome Regulation, Santiago, Chile.
⁷ Department of Veterinary Integrative Biosciences, Texas A&M University School of Veterinary Medicine, College Station, TX, USA.
⁸ Department of Small Animal Clinical Sciences, Texas A&M University School of Veterinary Medicine, College Station, TX, USA.
⁹ Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
¹⁰ Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.
¹¹ Department of Microbiology and Immunology, Columbia University, New York, NY, USA.
¹² Pacific Northwest Research Institute, Seattle, WA, USA. metzgerm@pnri.org.
¹³ Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA. metzgerm@pnri.org.

PMID: 37783804
PMCID: PMC10663159
DOI: 10.1038/s43018-023-00643-7

Centuries of genome instability and evolution in soft-shell clam, Mya arenaria, bivalve transmissible neoplasia

Samuel F M Hart et al. Nat Cancer. 2023 Nov.

. 2023 Nov;4(11):1561-1574.

doi: 10.1038/s43018-023-00643-7. Epub 2023 Oct 2.

Authors

Affiliations

¹ Pacific Northwest Research Institute, Seattle, WA, USA.
² Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA.
³ Division of Environmental and Biological Sciences, University of Maine at Machias, Machias, ME, USA.
⁴ Downeast Institute, Beals, ME, USA.
⁵ Instituto de Ciencias Biomedicas, Facultad de Medicina y Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago, Chile.
⁶ FONDAP Center for Genome Regulation, Santiago, Chile.
⁷ Department of Veterinary Integrative Biosciences, Texas A&M University School of Veterinary Medicine, College Station, TX, USA.
⁸ Department of Small Animal Clinical Sciences, Texas A&M University School of Veterinary Medicine, College Station, TX, USA.
⁹ Cancer Genetics and Comparative Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
¹⁰ Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.
¹¹ Department of Microbiology and Immunology, Columbia University, New York, NY, USA.
¹² Pacific Northwest Research Institute, Seattle, WA, USA. metzgerm@pnri.org.
¹³ Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA. metzgerm@pnri.org.

PMID: 37783804
PMCID: PMC10663159
DOI: 10.1038/s43018-023-00643-7

Abstract

Transmissible cancers are infectious parasitic clones that metastasize to new hosts, living past the death of the founder animal in which the cancer initiated. We investigated the evolutionary history of a cancer lineage that has spread though the soft-shell clam (Mya arenaria) population by assembling a chromosome-scale soft-shell clam reference genome and characterizing somatic mutations in transmissible cancer. We observe high mutation density, widespread copy-number gain, structural rearrangement, loss of heterozygosity, variable telomere lengths, mitochondrial genome expansion and transposable element activity, all indicative of an unstable cancer genome. We also discover a previously unreported mutational signature associated with overexpression of an error-prone polymerase and use this to estimate the lineage to be >200 years old. Our study reveals the ability for an invertebrate cancer lineage to survive for centuries while its genome continues to structurally mutate, likely contributing to the evolution of this lineage as a parasitic cancer.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. MarBTN distribution and sequencing.**
a, Locations of samples sequenced (circles) and disseminated neoplasia observations (indicated by x) along the east coast of North America. Circles colored for healthy clams (black) and MarBTN sampled from the PEI (red) or USA (blue) coast. b,c, Image of healthy clam used to assemble reference genome (MELC-2E11) (b) and hemolymph of the same clam (c), with hemocytes extending pseudopodia. The healthy reference clam (open black circle from a) was included in WGS analysis. d, Hemolymph from a clam infected with MarBTN (FFM-22F10), with distinct rounded morphology and lack of pseudopodia of cancer cells (representative of similar images from n = 8 MarBTN samples in this study). Scale bars, 10 mm (clam) and 50 µm (hemolymph). e, Phylogeny of cancer samples built from pairwise differences of SNVs not found in healthy clams, excluding regions that show evidence of LOH. Numbers along branches indicate the number of SNVs unique to and shared by individuals in that clade. All nodes have 100 of 100 bootstrap support. Source data

**Fig. 2. Unique mutational signature found in somatic mutations dates cancer to >200 years old.**
a, Trinucleotide context of SNVs found in healthy clams (top) and high-confidence somatic mutations in PEI (middle) or USA (bottom) sub-lineages, corrected for mutational opportunities in the clam genome. The trinucleotide order is the same as in b. b, De novo extracted mutational biases for SigS. c,d, Sig5′ (c) and SigS (d) attributed mutations per Mb (signature fitting estimates with fitting error) across USA MarBTN samples (n = 5) by sampling date. Results of linear regression with 95% CI (gray) overlaid. SNVs found in healthy clams, PEI MarBTN samples or LOH regions are excluded. e, Fraction of SNVs attributed to SigS from healthy clams (black), variants found in all MarBTN samples (gray) and high-confidence somatic mutations (colored). Variants found in all MarBTN samples are divided by whether they are found in healthy clams and whether they are homozygous (hmz) or heterozygous (htz). Dashed lines display SigS fraction estimates for likely somatic mutations and likely founder variants. f, Age estimate of the most recent common ancestor (MRCA) of the USA and PEI sub-lineages using Sig5′ and SigS and of the BTN origin from SigS mutations. g, dN:dS ratios (ratio of 1 indicates neutrality) for SNVs found in healthy clams (black), SNVs found in all MarBTN samples (gray) and high-confidence somatic mutations (colored) (n = 20,075,227, 7,676,209, 2,596,657, 320,715, 331,167 and 651,882 as shown from left to right). Error bars in all plots display 95% CI. Source data

**Fig. 3. Widespread copy number gain and structural mutation.**
a, Copy number calls across clam genome, rounded to the nearest integer (black) and unrounded (gray) in 100-kB segments. The healthy clam is a representative individual and the MarBTN sub-lineages are averages of each individual sample from that sub-lineage, which were in close agreement. b, Summary of copy number states across entire genomes for two non-reference healthy clams and MarBTN sub-lineages. Gray lines display copy number summaries for individual samples within each sub-lineage, which are in close agreement. c, Number of SVs in each sample. The reference clam was excluded as one haplotype from that animal was used to build the reference genome and thus does not contain SVs. Values were normalized to the average number of SVs in non-reference healthy clams for each SV type (numbers below SV type labels). P values are from two-sided unequal variance t-test between MarBTN samples (n = 8) and non-reference healthy clams (n = 2). Exact P values are 1.9 × 10⁻⁵, 2.9 × 10⁻², 1.0 × 10⁻⁵ and 8.0 × 10⁻¹¹, respectively. Labels follow DELLY abbreviations of SV types: BND, translocations; DEL, deletions; DUP, tandem duplications; INV, inversions. Bars indicate means and error bars indicate s.d. d, Size distribution of tandem duplications in each non-reference sample. Dashed line indicates 11 kB. e, Telomere length estimated by TelSeq for each sample. f, Tandem duplicate copies of the mitochondrial D-loop region per sample. Healthy clams are black, MarBTN from PEI are red and MarBTN samples from USA are blue. Source data

**Fig. 4. Somatic expansions of Steamer and other TEs.**
a, Phylogeny of all samples built from pairwise differences of Steamer insertion sites, colored by healthy (black), USA MarBTN (blue) and PEI MarBTN (red). Numbers along branches indicate the number of insertions unique to and shared by individuals in that clade, numbers on nodes indicate bootstrap support, with bootstrap values below 75 not shown. b, Logo plot of insertion bias relative to the 5-bp target site duplication (TSD) of all Steamer insertions, normalized by nucleotide content of the genome. c, Steamer insertion probability in annotated genome regions, normalized by read mapping rates and relative to full genome. Displayed for insertions found in all MarBTN samples but no healthy clams and unique to each sub-lineage but shared by all individual in that sub-lineage. Dashed line indicates expectation given random insertions. d, Volcano plot comparing copy number of all repeat elements in MarBTN and healthy clam samples by two-sided unequal variance t-test. Dashed lines correspond to significance threshold (P = 0.05, Bonferroni-corrected) and fivefold differences. Elements annotated as DNA transposons are marked in gray. Source data

**Fig. 5. Expression indicates hemocyte origin and possible mutagenic pathways in MarBTN.**
a, Principal-component analysis of normalized expression across all genes, with PC1 separating MarBTN and hemocytes from all other tissues. b, Volcano plot of expression of polymerase genes (n = 28 genes) for MarBTN (n = 5 isolates) compared to hemocytes (n = 5 clams). c, Normalized expression, in reads per gene, of TP53, HSPA9 (mortalin) and BRCA1 for MarBTN (n = 5 isolates), hemocytes (n = 5 clams) and non-hemocyte tissues (n = 15: 5 tissues for three clams). Error bars display standard deviation, differential expression comparison results from Wald test displayed as *P < 0.05; **P < 1×10⁻⁵; NS, not significant. Exact P values, adjusted for multiple comparisons, are 5.5 × 10⁻¹, 6.8 × 10⁻¹, 8.4 × 10⁻⁸, 5.0 × 10⁻⁷, 3.3 × 10⁻² and 1.6 × 10⁻⁹, respectively. Source data

**Extended Data Fig. 1. Minimal host DNA is found in cancer hemolymph samples.**
**(a)** Hemolymph images for the four clams in this study sampled 2018–21. The other seven clams sampled 2010–14 were reported in past studies by Arriagada & Metzger et al. (2014) and Metzger et al (2015). Scale bars are 50 µm. Fraction of cancer cells detected by MarBTN-specific qPCR, as reported by Giersch et al. (2022), are included in the lower left of each image. Note that while this assay is highly sensitive for the detection of low levels of MarBTN infection in animals, the fraction is a ratio of two qPCR values and minor variation in qPCR values can lead to large variation in the fraction when it is close to 100% cancer. **(b)** We identified SNVs in mitochondrial DNA in each individual sample and used the median VAF of those SNVs to estimate the purity of the sample. Number of loci: 21, 20 and 13 for healthy clams as ordered in figure, 53 (PEI) and 46 (USA) likely somatic for MarBTN samples. **(c)** Since mitochondrial genome copy numbers may differ between host and MarBTN cells, we also identified homozygous nuclear SNVs in regions called as copy number 2 in both sub-lineages and used the median VAF of those SNVs to estimate the purity of the sample (number of loci: 250,000 for non-reference healthy clams, 15,000 MarBTN-specific loci for MarBTN samples). Values for pure samples would be expected to be slightly below one due to mapping/sequencing errors, as evidenced by the healthy clams, which serve as pure sample controls (black, all DNA is from one individual). In cancer samples, deviation below this near-one value is attributed to the presence of contaminating host DNA (DNA is a mixture of two individuals – the cancer and the host). Two MarBTN isolates that were excluded from this study due to high host DNA contamination are included on this plot as contaminated sample controls (gray). Both nuclear and mitochondrial markers calculations yield similar estimates of cancer cell purity 96% or greater. MtDNA has the advantage of all loci being ‘homozygous’ and much greater depth than nuclear, giving more resolution as to the exact cancer cell percentage. However, mtDNA copies per cell may vary from sample to sample and between host and cancer. We also extracted DNA from tissue samples for a subset of the USA cancers and estimated the fraction of cancer DNA disseminated into tissue using the same methodology for mitochondrial **(d)** and nuclear **(e)** loci. Tissue samples contain variable and in some cases quite high, fractions of cancer DNA. This made genome-wide differentiation between host and cancer SNVs difficult in tissue and lead us to not include paired tissue DNA in our analyses, instead relying on variant calling thresholds to eliminate host variants from our cancer variant calling pipelines. Box plots display ggplot defaults - median (center), interquartile range (box), and the less extreme of minima/maxima or 1.5* interquartile range (whiskers). Source data

**Extended Data Fig. 2. Loss of heterozygosity regions have sub-lineage-specific founder variants.**
**(a)** Comparative sizes of the assembled genome and the fractions called as LOH in the PEI (red) and USA (blue) sub-lineages. **(b)** SNV density of sub-lineage-specific founder variants (variants found in a healthy clam and all individuals of one sub-lineage but none in the other sub-lineage) across the genome and LOH regions called in the other sub-lineage. Density is 36× greater for PEI mutations in USA LOH regions versus non-LOH regions and 20x greater for USA mutations in PEI LOH regions versus non-LOH regions. LOH regions were ignored for somatic mutation analysis to reduce the influence of remaining founder variants in sub-lineage specific SNVs, which should otherwise consist of somatic mutations. **(c)** We used various thresholds of stringency to call LOH across the genomes of each sub-lineage based on the number of shared SNVs that were homozygous in one sub-lineage but heterozygous in the other across a window of 50 SNVs (x-axis). After calling LOH, we calculated the fraction of likely somatic mutations attributed to signature S in LOH (squares) and non-LOH (circles) (y-axis). Values are shown separately for the BTN subgroups from USA (blue) and PEI (red). Vertical dashed line indicates the threshold used for LOH-calling. Horizontal dashed lines indicated baseline signature S fractions without LOH region removal. **(d)** Plot of the difference between non-LOH and LOH regions as shown in (c) (calculated by subtracting the square from the circle). Black line shows the average difference, which peaks around the threshold used (10). **(e)** Proportion of the genome that is called LOH for each sub-lineage based on calling threshold. Dashed lines indicate the fraction of the genome called as LOH for each sub-lineage for the final threshold used. Source data

**Extended Data Fig. 3. Raw mutational spectra and de novo extracted mutational signatures.**
(a) Plots show the mutational probability of SNVs in all trinucleotide contexts that were identified in various samples after filtering. Trinucleotide order is the same as shown in Fig. 2. Healthy clam SNVs (black labels - top) refer to SNVs that were unique to that clam and not found in other clams, resulting in no overlap of SNVs but still very similar spectra. SNVs found in all BTN samples (gray labels – upper middle) are divided into those found in a healthy clam (likely all from the founder clam genome) and those not found in any of the three healthy clams (includes a mixture of founder and early somatic mutations). Likely somatic SNVs found within the USA (blue labels) and PEI (red labels) sub-lineages show those SNVs that are either shared between all samples (Fig. 2a - not shown here), multiple samples (lower middle), or unique to individual samples (bottom). SNVs found in All mutational probabilities are corrected for mutational opportunities in the clam genome, and total mutation counts in each image are shown in the label. **(b)** We performed de novo mutational signature extraction to identify trinucleotide SNV differences between the various samples in this study, yielding four mutational signatures with mutational probabilities corrected for mutational opportunities in the clam genome. Error bars display 95% confidence intervals as determined by the extraction software, sigfit. Signatures sig1’, sig5’ and sig40’ are named after the closest signature in the COSMIC database, as determined by cosine similarity. SigS was named to reflect that it was specific to Somatic mutations in cancer samples. Source data

**Extended Data Fig. 4. Signature fractions across sample groupings.**
Plots showing the fraction of genomic SNV fractions attributed to **(a)** signature S, **(b)** signature 1’, **(c)** signature 5’, and **(d)** signature 40’ across healthy and cancer samples, divided and filtered as described in Extended Data Fig. 3, methods, and diagramed in Extended Data Fig. 10. ‘All healthy clams’ refers to SNVs found in all 3 healthy clams in our data set, but not in the reference genome. **(e)** Fraction of mutations attributed to signature 1 across the whole genome (triangles, same data as shown in (b)) is shown compared to the fraction of signature 1 in coding regions alone (CDS, circles). Note that trinucleotide contexts of mutational opportunities are different in coding regions versus the full genome, which was factored into in the signature fitting process. Points indicate fitting estimate, while error bars display 95% confidence intervals of mutation fractions from fitting error of SNVs to the four mutational signatures. Number of total mutations for each SNV set can be found in Extended Data Fig. 3. Source data

**Extended Data Fig. 5. Mutations versus sampling date.**
(a) Mutations attributed to each mutational signature versus sampling date for MarBTN samples. SNVs found in healthy clams, all BTN samples, or LOH regions are excluded prior to analysis to remove founder variants. Results from linear regression of USA samples (n = 5) are shown above each plot, including R squared, p value, mutation rate estimate and the corresponding x-intercept (indicating date the two sub-lineages diverged from one another). PEI samples (n = 3) are included on plots to compare relative mutation counts attributed to each signature but are not included in the linear regression. It is apparent that sig1’mutation counts are higher in PEI, while sig5’ and sig40 mutations are higher in USA. SigS mutations in PEI line up well with the USA sample regression, indicating that sigS mutation rate has stayed stable since the sub-lineages diverged. Points indicate fitting estimate, while error bars indicate 95% confidence interval from signature fitting error. **(b)** Number of translocations and tandem duplications since the divergence of the sub-lineages, copies of the mitochondrial D-loop, and total Steamer insertions per sample, each plotted against sampling date. Linear regression (blue line) and 95% confidence interval (gray) were calculated for the USA samples (n = 5). No regression was statistically significant. No PEI samples (n = 3) fell within 95% confidence intervals of regression lines, indicating the higher mutation counts in USA samples cannot be explained by the later sampling of USA samples. Source data

**Extended Data Fig. 6. Copy number and structural alteration characterization.**
(a) We called copy number across the genome in 100-kB chunks for each sample individually. Here we plot pairwise comparisons of the copy number call for each 100-kB chunk between two representative PEI BTN samples (DN08 and DF488) and two representative USA BTN samples (FFM19G1 and NYTC-C9: notably, the two most distantly related USA samples). There is a close correlation (R2 > 0.94) within sub-lineages (DN08 vs DF488, FFM19G1 vs NYTC-C9) and a weaker correlation (R2 = 0.53–0.56) when comparing between sub-lineages (DN08 or DF488 vs FFM19G1 or NYTC-C9). Copy number differences between samples can be seen here as denser groupings of points around integer values that deviate from equal values along the diagonal. Variant allele frequencies of all high confidence somatic mutations were calculated separately for BTN from **(b)** USA) and **(c)** PEI. Violin plots show probability densities of allele frequencies of high confidence somatic mutations, divided into portions of the genome called at each copy number. The peak allele frequency in each case is distributed around the expected value of 1/copy number. In addition to the main, expected peaks for each copy number, in some cases, additional peaks can be seen that indicate somatic mutations prior to copy number gain (for example VAF of 0.5 in regions with CN4 that could be due to mutation followed by duplication of the region). Some minor peaks also indicate possible errors in copy number calling or allele frequency counting (e,g, VAF of 0.5 in CN3 regions). These errors could be due to lower read mapping due in polymorphic region, errors caused by repeat regions, regions spanning a CN breakpoint, among other possibilities. **(d)** Distribution of variant allele frequencies for founder germline variants (found in all cancers and at least one healthy sample) in USA (blue) and PEI (red) sub-lineage, restricted to regions that are CN4 in both sub-lineages. **(e)** A random subset of 100,000 germline variants plotted as a scatter plot. Alleles at 1/4 and 3/4 in the USA sub-lineage are incongruent with a simple CN2 > CN4 duplication. **(f)** Distribution of variant allele frequencies for high confidence somatic mutations, restricted to regions that are CN4 in both sub-lineages, showing a higher proportion of 2/4 mutations (pre-duplication SNVs) in PEI than USA. **(g)** The genome was subdivided into 100-kb segments (as done for copy number analysis), and for all shared CN4 segments the plot shows the fraction of mutations in each 100kB segment that were at 2/4 frequency compared to the total amount of 2/4 and 1/4 SNVs, corresponding to mutations occurring before or after duplication of the allele, respectively. While the USA distribution peaks at 0, indicating most 100kB segments duplicated before or shortly after the USA-PEI sub-lineage split, with a low rate of duplications occurring after that time, the distribution for PEI centers around 0.2, indicating that one-fifth of mutations occurred between the USA-PEI sub-lineage split and duplication of the corresponding regions, suggesting a burst of duplications at some point in the PEI sub-lineage. **(h)** Number of called SVs of each type that are unique to each sub-lineage were calculated by removing SVs found in any healthy clams or in any BTN samples from the other sub-lineage. Dots represent individual samples, bars summarize averages for each group, and error bars indicate standard deviation. P-values are from two-sided unpaired unequal variance t-test between PEI BTN samples (n = 3) and USA BTN samples (n = 5). Exact values are 1.8e-3, 6.6e-1, 1.0e-5, 1.9e-2, and 3.6e-1 respectively. Labels follow delly abbreviations of SV types: BND = translocations, DEL = deletions, DUP = tandem duplications, INS = small insertions, INV = Inversions. Deletion counts were much higher than other SV types, so were divided by 10 in (B) for visualization (‘DEL/10’). **(i)** Size distribution of tandem duplications in each sample, after removing SVs found in any healthy clams or in any BTN samples from the other sub-lineage. Dashed line indicates 11 kB. Source data

**Extended Data Fig. 7. Mitochondrial mutations in MarBTN.**
(a) Neighbor joining tree built from variants called in all samples (170 SNVs) against the previously published *M. arenaria* reference mitogenome (excluding the repeated region). Bootstrap values in support of each clade are included on the preceding branch (bootstraps under 50 are not shown). The phylogenetic relationship generally reflects that built from genomic SNVs (that is, monophyletic MarBTN group with separate USA and PEI sub-lineages). The phylogeny within the USA sub-lineage deviates from that built from the nuclear genome, but only three SNVs are variable within the USA sub-lineage: one SNV unique to NYTC-C9 and two SNVs unique to MELC-A11. This causes the other samples to cluster more often with NYTC-C9 due to only one difference (versus two versus MELC-A11), but this relationship is still compatible with the USA branch structure from the nuclear phylogeny. **(b)** Observed SNVs (black) compared with expected counts estimated from nucleotide frequencies of the *M. arenaria* mitogenome and assuming equal mutation probability. This calculation was not collapsed to the usual 6 mutation types due to the imbalance of nucleotides in mitochondrial genomes (unequal frequencies of G/C and A/T). Likely somatic refers to SNVs found in a subset of BTN samples, while All USA and All PEI refer to SNVs found in all individuals from that sub-lineage, but not the other sub-lineage. **(c)** dN/dS ratios, where a ratio of 1 indicates neutrality, were calculated for mitochondrial SNVs found in healthy clams (n = 39), all BTN samples but not healthy clams (n = 13), and likely somatic mutations (n = 50). Error bars indicate 95% confidence intervals as estimated by dndscv and are quite large, due to the low number of mitochondrial SNVs. **(d)** Read depth across the mitochondrial genome for healthy clams (black), PEI MarBTN (red) and USA MarBTN (blue), normalized to mean depth outside D-loop. Bars above indicate the D-loop region (12,164–12,870 bp, black) and the region used to estimate duplicated region copy number (12,300–12,500 bp, gray), as shown in Fig. 3f. **(e)** Schematic (not to scale) of the control region of the *M. arenaria* control region in the previously published mitogenome with a single d-loop copy (top) versus the proposed mitochondrial genome with three d-loop copies and G-rich insertions (middle) with accompanying PCR results (bottom). Primer pair combinations are listed along top of gel and expected sizes are listed along bottom, molecular weights are in bp. Amplicon sizes from primers spanning the D-loop (67 with 62/71) support a single copy of the D-loop. However, we suspect this is a result of recombination and selection for the smaller product and loss of the G-rich insertions. Inverse PCR with outward-facing primers (65 with 72/72) indicates a tandem duplication allowing outward-facing primers to amplify. The inverse primers spanning the G-rich insertion (65 with 72) has a dim band at expected size, but two brighter bands at smaller sizes. PCR was run once. Source data

**Extended Data Fig. 8. Transposable element activity in MarBTN.**
(a) We conducted a BLASTP search for the 729 cancer-associated genes in the COSMIC database and found hits in 5,430 of the 38,609 predicted M. arenaria genes (14%). If there is not selection for insertion near these genes, we would expect 14% of Steamer insertions with a M. arenaria gene to intersect with these genes. We counted the number of steamer insertions in genes (‘gene’) and in the 2 kB upstream genes (‘upstream’) for early steamer insertions in the lineage trunk (‘all MarBTN’) and after the divergence of the sub-lineages (‘USA or PEI’). We plotted these counts (black) against that expected by chance (gray). Counts match expected closely for late insertions (in only the USA or PEI sub-lineage – right side of plot), either upstream genes or within them, but were higher than expected for early insertions. We further divided upstream insertions by whether the steamer insertion was in the same strand/direction as the gene or opposite, to compare with counts regardless of directionality (‘both’). The early insertion bias to insert upstream cosmic genes can be fully explained by a bias to insert in the opposite strand (yellow star), here with 9/23 (39%) of the genes being cancer associated (would expect 3/23: Chi-squared test, Bonferroni-corrected p value = 0.004). **(b)** Volcano plots showing estimated copy number of each TE, comparing copy number from PEI MarBTN with USA MarBNT for all TE types (left), LTR elements (middle), and DNA transposons (right), compared by two-sided unequal variance t-test. TEs more highly amplified in PEI MarBTN are to the right and TEs amplified more highly in USA MarBTN are to the left. Dashed lines correspond to significance threshold (p = 0.05, Bonferroni-corrected) and 5-fold differences. DNA transposons are labeled in blue and Steamer is labeled in green. Eight LTR retrotransposons and five DNA transposons are significantly amplified in the USA sub-lineage compared to the PEI sub-lineage, while no identified LTR retrotransposons and a single DNA transposon TEs are significantly amplified in the PEI sub-lineage compared to the USA sub-lineage. **(c)** Left histogram showing the distance to nearest gene for Steamer insertions found in any cancer sample (n = 550). If an insertion was within an annotated gene, the distance to the next nearest insertion was used. 0 (vertical red line) corresponds to the first or last nucleotide of the annotated gene for when the insertion is upstream (negative) or downstream (positive) relative to the gene, respectively. Horizontal red segment highlights 2 kB upstream genes with elevated Steamer insertions. Right histogram shows a distribution of randomly generated insertion sites (n = 224,134) based off the observed read mapping in the genome assuming insertions are random. Source data

**Extended Data Fig. 9. Differential expression results.**
(a) Hierarchical clustering of all RNA sequenced samples by the expression of the top 100 most significant genes expressed in each specific healthy tissue relative to all other tissues, with heatmap of normalized relative gene expression for each gene. MarBTN (BTN) clusters most closely with hemocytes (heme), supporting principal-component analysis results. **(b)** Volcano plot of polymerase genes expression (n = 28) for MarBTN (n = 5) compared with non-hemocyte tissues (n = 15: 5 tissues for 3 clams). **(c)** Normalized expression, in reads per gene, of four genes with detectable positive dN/dS for MarBTN (n = 5), hemocytes (n = 5), and non-hemocyte tissues (n = 15: 5 tissues for 3 clams). Bars display mean, error bars display standard deviation, and differential expression comparison results displayed as * = p<0.01, ** = p<1e-7, ns = not significant. Exact p-values are 9.6e-1, 3.7e-2, 1.8e-1, 8.1e-3, 7.4e-1, 1.5e-8, 6.0e-3 and 3.1e-1 respectively. Source data

**Extended Data Fig. 10. SNV binning strategy for analysis.**
Flowchart of our strategy to separate SNVs into bins for de novo signature extraction, based on which sample(s) each SNV was called in. Many of these bins were also used in other analyses, as indicated in the manuscript. The starting point refers to a vcf file of every SNV that was called in at least one of the eleven sample (three healthy, eight cancer) sequenced in this study. Bins highlighted in yellow indicate non-overlapping SNV bins used to for signature extraction.

See this image and copyright information in PMC

References

1. Ní Leathlobhair M, Lenski RE. Population genetics of clonally transmissible cancers. Nat. Ecol. Evol. 2022;6:1077–1089. - PubMed
1. Murgia C, Pritchard JK, Kim SY, Fassati A, Weiss RA. Clonal origin and evolution of a transmissible cancer. Cell. 2006;126:477–487. - PMC - PubMed
1. Rebbeck CA, Thomas R, Breen M, Leroi AM, Burt A. Origins and evolution of a transmissible cancer. Evolution. 2009;63:2340–2349. - PubMed
1. Pearse A-M, Swift K. Transmission of devil facial-tumour disease. Nature. 2006;439:549. - PubMed
1. Pye RJ, et al. A second transmissible cancer in Tasmanian devils. PNAS. 2016;113:374–379. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Centuries of genome instability and evolution in soft-shell clam, Mya arenaria, bivalve transmissible neoplasia

Affiliations

Centuries of genome instability and evolution in soft-shell clam, Mya arenaria, bivalve transmissible neoplasia

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous