Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Feb 1;554(7690):56-61.
doi: 10.1038/nature25473. Epub 2018 Jan 24.

The genome of Schmidtea mediterranea and the evolution of core cellular mechanisms

Affiliations

The genome of Schmidtea mediterranea and the evolution of core cellular mechanisms

Markus Alexander Grohme et al. Nature. .

Abstract

The planarian Schmidtea mediterranea is an important model for stem cell research and regeneration, but adequate genome resources for this species have been lacking. Here we report a highly contiguous genome assembly of S. mediterranea, using long-read sequencing and a de novo assembler (MARVEL) enhanced for low-complexity reads. The S. mediterranea genome is highly polymorphic and repetitive, and harbours a novel class of giant retroelements. Furthermore, the genome assembly lacks a number of highly conserved genes, including critical components of the mitotic spindle assembly checkpoint, but planarians maintain checkpoint function. Our genome assembly provides a key model system resource that will be useful for studying regeneration and the evolutionary plasticity of core cell biological mechanisms.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Extended Data Figure 1
Extended Data Figure 1. Smed sequencing and assembly quality control.
a) Smed genomic DNA preparations: The established protocol (top) yields a black solution due to co-purification of porphyrin pigments. Bottom: improved protocol, which removes contaminants including the pigment and therefore results in clear preparations. b) The improved protocol consistently yields HMW DNA, as shown by the pulse field gel electrophoresis of two independent preparations (lanes 3, 4) and DNA size markers in lanes 1 and 2. c) Overview of all PacBio sequencing runs for the Smed assembly. d) Sequencing statistics of a representative PacBio RS II SMRT cell (P6/C4 chemistry). Total output: 1,053.4 Mbp, Reads of insert: 976.4 Mbp, maximal read length: 52,441 bp. e) Connectivity matrix plot illustrating Chicago library read-pair distances after HiRise scaffolding. Colour coding identifies individual contigs contributing to the scaffold dd_Smed_g4_1. f) Mapping characteristics of Smed transcriptomes against the genome assembly with > 60% query coverage and > 60 % sequence identity as cut-off criteria. Left: the dd_Smes_v1.PCFL transcriptome of the sequenced strain. Right: dd_Smed_v6.PCFL transcriptome of the asexual strain. The pie charts visualize the absolute number and relative proportions of transcripts mapping with the indicated characteristics. g) Further analysis of the 538 non-mapping Smes transcripts from e) (see Supplementary Information 7). Missing gene: Transcripts that map uniquely to the SmedSxl v4.0 assembly and have annotated orthologues in at least 5 other planarian species in PlanMine. Putative contaminant: Top RefSeq BLAST hit in a likely contaminant species. Unknown: All remaining transcripts. The fact that only 46 out of 31,966 Smes transcripts are classified as genuinely missing indicates that the Smed assembly is largely complete. In contrast, 1,229 transcripts that uniquely mapped to the Smed genome and had orthologues in at least 5 other planarian species failed to map to the previously published SmedSxl v4.0 assembly. Substantial gaps in the previous assembly also mean that the number of missing genes in the Smed assembly may be slightly higher, as some may have been classified as “unknown”.
Extended Data Figure 2
Extended Data Figure 2. Assembly validation by high stringency transcript back-mapping.
a) Quality control of the Smed assembly by means of high stringency back mapping of 1,509 high confidence (HC) cDNAs. HC-cDNAs were defined as having BLAST hits with > 90% query and subject coverage in 7 other planarian transcriptomes in PlanMine. HC-cDNAs were mapped to the Smed assembly using > 90 % query coverage and sequence identity as cut-off criteria. The pie chart visualizes the absolute number and relative proportions of HC-cDNAs mapping with the indicated characteristics. b) Further analysis of the 10 HC-cDNAs classified as non-mapping from a) by intersection with the mapping results of Extended Data Fig. 1g. These 2 were designated as “false positive”, since both mapped to the Smed genome with > 90 % query coverage and sequence identity using BLAT. c) UCSC genome browser screenshot (75 kbp window) of the genomic mapping location of one of the two “unknown” HC-cDNAs as single example of a mapping failure due to an actual assembly error. The example documents inversion of the 5’-end of the cDNA within a low confidence stretch at a contig end (lack of coverage in the Quiver track). The inversion is supported by i) inverted RNAseq read mapping and ii) inversion of the cDNA sequence shown in the respective tracks. Below: Color-coded Miropeats similarity plots of respective regions. d), e) Examples of genomic mapping loci of HC-cDNA transcripts out of the multi-mapping category in a), browser screen shots as described in c). d) Example of a likely legitimate (biological) gene duplication in a gap-free high confidence region. e) Micro tandem duplication surrounding a scaffolding gap in a repeat rich region. f) Multi-mapping HC-cDNAs map preferentially to contig ends. The histogram graphs the distance of the closest gap or contig end for the 67 multi-mappers and a corresponding number of unique mappers a). g) Estimated size of the duplicated regions of multi-mapping HC-cDNAs. Jointly, this analysis identifies a small fraction of small-scale duplications at assembly gaps in the Smed assembly, which can be easily identified with the help of the various quality control tracks in the PlanMine genome browser.
Extended Data Figure 3
Extended Data Figure 3. Repeats in the Smed assembly.
a) Abundance estimation of solo and full-length LTR elements in the Smed assembly. Elements SLF-8 and SLF14 show a large number of solo-LTRs compared to full-length copies, indicating a large number of excision events by homologous recombination. Of the Burro elements, Burro-1 was the most abundant with 124 full-length copies, followed by Burro-3 and Burro-2 with 25 and 23 full-length copies, respectively. b) Length comparison of indicated repeat consensi classes in H. sapiens, D. melanogaster and C. elegans. For Smed, we used a custom library generated in this study. Dark colours indicate predominant lengths of specific repeat classes. Red: repeat consensi with more than 15 kbp in length). c) Expression analysis of gypsy LTR elements in Smed RNAseq data using TETranscripts. The 3 most transcriptionally active elements were Burro-1, Burro-2 and SLF-8. d) LTR Expression analysis by whole mount in situ hybridization and single cell expression data. Top: SLF-9 derived transcript. Bottom: Burro-1 derived transcript. Both are broadly transcribed in many Smed cell types (CIW4 strain, n=1 biological replicate, 10 animals). Scale bar: 250 µm. e) Kimura distance plot of Smed LTR elements. Substitution levels varied by element, but also within element groups. Burro-1/2/3 and SLF-8 all contain elements spread over a large range of substitution levels, possibly indicative of continued activity over large time scales. The remaining elements are characterised by more defined peaks in expansion, with the highest average divergences being seen in the smallest elements characterized (SLF-10/11/12), making these amongst the oldest within the genome. Interestingly, both SLF-8 and SLF-9 have representative elements with particularly low substitution rates, potentially indicating a recent or ongoing expansion.
Extended Data Figure 4
Extended Data Figure 4. AT-rich microsatellites in the Smed genome.
a) Features of AT-rich microsatellites. Left: Inter-repeat spacing of repeats > 99 bp in length. Right: Repeat length. AT-rich microsatellites with an average length of 120 bp occur every ~3,500 bp. b) Genomic distribution of repeats > 99 bp in length. c) Increased probability of read alignment termination within microsatellite repeats. Individual size bins were analyzed separately for microsatellite repeats (red) or non-repetitive regions (cyan). Although accounting for only 4.2 % of the assembly size, microsatellite repeats significantly limit assembly contiguity due to an increased probability of read alignment loss. d) Genome-wide coverage ratios of insertion/deletion sequences > 99 bp and excluding AT-repeats. e) Read length variation analysis across AT-rich repeat regions (AAT) in regular PacBio sequencing data compared to Circular Consensus Sequencing (CCS) coverage of the same region. CCS reads sample the same genomic region multiple times. The lack of a clear difference in the length variation of specific AT-repeats (AAT) between repetitive sequencing of the same DNA molecule (CCS data set) versus sequencing reads representing different DNA molecules (regular PacBio data) indicates that repeat length variations are mainly technical in nature. Rather than repeat length polymorphisms, the most likely cause of the detrimental effect of the repeats is the increased ambiguity in low complexity sequence alignments (Supplementary Information S11.4). Unique (UQ) regions were included as controls. (Green) CCS_UQ: CCS subread length variation versus the consensus length of all subreads in binned unique regions (n = 3300). (Red) CCS_UQ: CCS subread length variation versus the consensus length of all subreads in binned AT-repeat regions (n = 4825). (Blue) P6_UQ: Length variation of individual reads in the regular PacBio sequencing data (P6/C4) versus the consensus length of the region in the Smed assembly in binned unique regions (n = 3310). (Black) P6_AT: Length variation of individual reads in the regular PacBio sequencing data (P6/C4) versus the consensus length of the region in the Smed assembly in binned AT-repeat regions (n = 5085). Dots: outliers, horizontal line in the middle of the box: 2nd quartile == median, box ranges: from 1st quartile to 3rd quartile, whiskers: interquartile range (IQR, midspread): 75th and 25th percentile.
Extended Data Figure 5
Extended Data Figure 5. comparative genomics.
a) Table listing contig and scaffold N50 statistics of the genomes used for the comparative genome alignments in Fig. 3b. The table reveals that the basal vertebrate lamprey genome assembly is more fragmented (similar or lower N50 values) than most other platyhelminth genomes. Nevertheless, the human to lamprey genome alignment has equivalent or even higher alignment chain scores and spans, indicating that the true extent of sequence divergence and loss of conserved gene order in platyhelminths is likely an underestimate. b) Example of a top-scoring alignment chain. The UCSC genome browser screenshot of the Smed genome shows that alignments predominantly overlap exons of the two transcripts shown at the top. This example is one of the few cases of apparent gene order conservation between Smed and S. mansoni. Blocks in the alignment chains represent local alignments, connecting single lines represent deletions in the query genome and double lines represent regions with sequence in both Smed and the query genome that do not align. c) Comparative loss analysis of highly conserved genes across the 26 indicated species. Red: Conserved gene fraction, defined as the proportion of orthogroups containing at least 9 out of the 14 non-flatworm species and the query species. Blue: Lost fraction of highly conserved genes, defined as the proportion of orthogroups containing at least 9 out of the 14 non-flatworm species, but not the query species (See Supplementary Information S17). Absolute numbers of highly conserved genes are shown on top, with slight fluctuations caused by species-specific sequence duplications.
Extended Data Figure 6
Extended Data Figure 6. Planarian-specific genes.
a) Conservation of 1,165 flatworm-specific genes (Supplementary Information S16.1) amongst flatworm species. Only 61 sequences had sequence homologues in the indicated flatworm species (Other = T. solium, E. multilocularis, E. granulosus, H. microstoma), indicating that this gene set mostly represents planarian-specific genes. b) and c) characteristics of planarian-specific genes. b) Distribution of exon numbers compared to a control gene set (HC-cDNAs; Extended Data Fig. 2a), indicating an enrichment of single exon genes. c) Number of predicted domains (InterProScan), indicating that only a minority contains predicted domains. d) Identity of detected domains (Pfam and SUPERFAMILY). “unintegrated signatures” designates recurring sequence motifs that are not grouped into InterPro entries. These might represent so far un-curated or weakly supported motifs that do not pass InterPro's integration standards. e) Differential expression of 626 planarian-specific genes in published Smed RNAseq data sets of different regeneration phases (left), stem cells or progeny populations (middle) or specific developmental stages (right). Red lines indicate differential expression relative to the control of each series (white = no change). Genes were ordered using rank by sum. The high proportion of differential expression indicates the widespread contribution of lineage-specific genes to planarian biology. f) and g) Specific examples of non-conserved genes. Top: SMART domain representation. Bottom: Differential expression under the indicated conditions.
Extended Data Figure 7
Extended Data Figure 7. Sequence conservation of Mad1 protein in non-planarian flatworms.
a) COBALT multiple protein sequence alignment of the Mad1 homologues of the indicated species (including all the non-planarian flatworm species of Fig. 3c). b) Heatmap of BLOSUM62 sequence similarity matrix generated from alignment in a), demonstrating significant sequence conservation of Mad1 homologues even in flatworms.
Extended Data Figure 8
Extended Data Figure 8. Sequence conservation of Mad2 protein in non-planarian flatworms.
a) COBALT multiple protein sequence alignment of the Mad2 homologues of the indicated species (including all the non-planarian flatworm species of Fig. 3c). b) Heatmap of BLOSUM62 sequence similarity matrix generated from alignment in a), demonstrating significant sequence conservation of Mad2 homologues even in flatworms.
Extended Data Figure 9
Extended Data Figure 9. Effect of cdc20(RNAi) and SAC components on the planarian stem cell compartment.
a) Fluorescent whole mount in situ hybridization of the planarian head region. Stem cells (neoblasts) were visualized by a smedwi-1 probe (red), early+late progeny by pooled prog-1 and agat-1 probes (green). Nuclear counterstaining by DAPI (blue). Top: RNAi control against egfp, Bottom: cdc20(RNAi), which results in a dramatically decreased number of smedwi-1 and prog-1/agat-1 positive cells after 3 rounds of RNAi feeding. This indicates the loss of neoblasts and a concomitant reduction in progenitor numbers (n=1 biological replicate, 10 animals). Scale bar: 200 µm. b) Effect of indicated RNAi treatments on planarian stem cell abundance. Representative images of cell macerates, stained with DAPI (nuclei, blue), anti-H3ser10P (mitotic cells, magenta) and smedwi-1 in situ hybridization (stem cells, yellow). Numbers indicate the mean fraction ± s.d. of smedwi-1 positive cells of total cells quantified by nuclear counting using DAPI (n=1, 10 pooled animals, 5 technical replicates with 5 images each). Scale bar: 50 µm.
Figure 1
Figure 1. Long-range contiguous genome assembly of S. mediterranea (Smed).
a) Individual of the sequenced sexual strain. Left: Egg cocoons. Right: Karyotype (2N = 8). Scale bars: 2 mm and 2.5 µm. b) Chicago quality control of the assembly. c) Treemap comparison between the MARVEL Smed assembly and the most contiguous existing Smed Sanger assembly. Squares encode the relative contribution of individual scaffolds/contigs to assembly size.
Figure 2
Figure 2. Smed Assembly challenges.
a) Repeat content of the assembly. b) Long Terminal Repeat (LTR) family phylogeny. Known LTR families are shown in colour, Smed LTR families in black. Red arcs delimit clusters for consensus calculation. Scale bar: 0.2 substitutions/site. c) Domain annotation of the 11 Smed LTR families. SLF: Smed LTR Family. d) Enrichment analysis of indicated repeat elements within the terminal 1,000 bp of all scaffolds (n = 962). “Expected” represent mean repeat frequency with 95% bootstrap CI (n = 1,000). e) Graphical representation of representative ~1.6 Mbp and ~1.7 Mbp segments of Smed (left) and D. melanogaster (right) MARVEL PacBio assembly graph segments. Thick lines: Consensus sequence; thin lines: individual read alignments; Colour-coding: alignment quality (blue: low, red: high); black marks: repeats. The contig tour of the final haploid genome assembly is shown offset to the right, alternative regions are shown in red. f) Dot plot comparison between a representative alternative region and the corresponding main contig. Fwd: Forward match. Rev: Reverse match. Break: insertions/deletions > 99 bp. Break annotations (right) list repeat categories that cover > 60% of the insertion/deletion sequence, “mixed” indicates contributions of multiple repeat classes.
Figure 3
Figure 3. Genome divergence of Smed and other flatworms.
a) Protein sequence divergence amongst 51 single copy genes (Supplementary Table 3). Branch length: substitutions per site, color coding: flatworms (red) and lophotrochozoan outgroups (blue). b) Whole genome alignments of Smed, M. lignano and H. sapiens against the indicated reference genomes. The distribution of the alignment score (top) and alignment span (bottom) of the top 10,000 chains of co-linear alignments is shown as box plots, with boxes indicating the 1st quartile, the median and the 3rd quartile with whiskers extending up to 1.5 times the interquartile distance. Outliers are defined as > 1.5 times the interquartile and are shown as dots. c) Presence (green) or absence (red) of highly conserved genes in the indicated species. The yellow box highlights Smed. *: homologues secondarily identified by manual searches.
Figure 4
Figure 4. Spindle assembly checkpoint (SAC) function in likely absence of Mad1:Mad2
a) Cartoon illustration of SAC core components and function. Black/Red: Components conserved/missing in Smed. KMN network: KNL1, MIS12 complex, NDC80 complex. b) Fractional abundance of mitotic cells under RNAi of the indicated SAC components, with (red) and without (cyan) nocodazole pre-treatment. Values are shown as mean with 95% confidence intervals (n=4 biological replicates, 10 pooled animals, 5 technical replicates with 5-6 images each). cdc20(RNAi) is shown as single replicate due to rapid stem cell loss (Supplementary Information S18, Extended Data Fig. 9a, b). TSignificance assessment by two-way ANOVA, followed by Dunnett’s post-hoc test (****P < 0.0001; n.s. not significant), excluding cdc20(RNAi).

Comment in

References

    1. Rink JC. Stem cell systems and regeneration in planaria. Dev Genes Evol. 2013;223:67–84. - PMC - PubMed
    1. Saló E, Agata K. Planarian regeneration: a classic topic claiming new attention. Int J Dev Biol. 2012;56:3–4. - PubMed
    1. Reddien PW, Sánchez Alvarado A. Fundamentals of planarian regeneration. Annu Rev Cell Dev Biol. 2004;20:725–757. - PubMed
    1. Wagner DE, Wang IE, Reddien PW. Clonogenic neoblasts are pluripotent adult stem cells that underlie planarian regeneration. 2011;332:811–816. - PMC - PubMed
    1. Onal P, et al. Gene expression of pluripotency determinants is conserved between mammalian and planarian stem cells. EMBO J. 2012;31:2755–2769. - PMC - PubMed

Publication types

LinkOut - more resources