. 2019 Jun;29(6):1009-1022.

doi: 10.1101/gr.244830.118. Epub 2019 May 23.

Recompleting the Caenorhabditis elegans genome

Jun Yoshimura^#¹, Kazuki Ichikawa^#¹, Massa J Shoura^#², Karen L Artiles^#², Idan Gabdank³, Lamia Wahba², Cheryl L Smith^{2

3}, Mark L Edgley⁴, Ann E Rougvie⁵, Andrew Z Fire^{2

3}, Shinichi Morishita¹, Erich M Schwarz⁶

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8583, Japan.
² Department of Pathology, Stanford University, Stanford, California 94305, USA.
³ Department of Genetics, Stanford University, Stanford, California 94305, USA.
⁴ Department of Zoology and Michael Smith Laboratories, University of British Columbia, Vancouver V6T 1Z3, British Columbia, Canada.
⁵ Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota 55454, USA.
⁶ Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA.

^# Contributed equally.

PMID: 31123080
PMCID: PMC6581061
DOI: 10.1101/gr.244830.118

Recompleting the Caenorhabditis elegans genome

Jun Yoshimura et al. Genome Res. 2019 Jun.

. 2019 Jun;29(6):1009-1022.

doi: 10.1101/gr.244830.118. Epub 2019 May 23.

Authors

Affiliations

¹ Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba 277-8583, Japan.
² Department of Pathology, Stanford University, Stanford, California 94305, USA.
³ Department of Genetics, Stanford University, Stanford, California 94305, USA.
⁴ Department of Zoology and Michael Smith Laboratories, University of British Columbia, Vancouver V6T 1Z3, British Columbia, Canada.
⁵ Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, Minnesota 55454, USA.
⁶ Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA.

^# Contributed equally.

PMID: 31123080
PMCID: PMC6581061
DOI: 10.1101/gr.244830.118

Abstract

Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. elegans available today. To provide a more accurate C. elegans genome, we performed long-read assembly of VC2010, a modern strain derived from N2. Our VC2010 assembly has 99.98% identity to N2 but with an additional 1.8 Mb including tandem repeat expansions and genome duplications. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2. Over 98% of N2 genes encoded unchanged products in VC2010; moreover, we predicted ≥53 new genes in VC2010. The recompleted genome of C. elegans should be a valuable resource for genetics, genomics, and systems biology.

PubMed Disclaimer

Figures

**Figure 1.**
Steps for detecting and filling gaps. (A) Contigs are ordered along the N2 reference assembly. Parts shown as dangling (colored light orange) fail to align and are missing in the N2 reference. (B) At a gap, regions in two Canu contigs (orange) map to proximal loci on the N2 reference; however, the two contigs have dangling end subsequences missing in the reference. In such cases, we estimate gaps between the contigs according to steps illustrated in C–G. (C) A single contig in other assemblies (yellow) fills a gap. (D) A long contig in other assemblies combines multiple contigs separated by more than one gap. (E) More than one contig fills a gap. (F) A single error-corrected read (light blue) fills a gap. (G) A hybrid approach of using multiple contigs and error-corrected reads fills a gap.

**Figure 2.**
Large gaps closed by long Nanopore reads. (A) Contigs of seven genome assemblies are aligned with Chromosome I of the N2 reference (see layouts for all chromosomes in Supplemental Fig. S2). The respective red and blue thick lines show alignments of contigs in the plus and minus strands. The vertical red line shows a large gap that failed to be filled by seven genome assemblies. (B–D) Examples of provisional gap closure using Nanopore data for a region where a long gap was found. (B) A self-dot plot for an initial model in which we ligate the last 30 kb of sequence from a contig just before a gap on Chromosome I (colored red) to 30 kb of sequence from another contig just after that gap. Two black boxes represent long tandem repeat expansions around the gap. (C) A dot plot between a single 92,790-nt Nanopore read (green) that connects the gap and the simple ligation model in B. (D) A self-dot plot of the Nanopore read shows that the two tandem repeats in C were underestimated. In this example, the left tandem repeat (red asterisk) has 1130 copies of a 26-nt unit string (5′-CATTTTTCTAAAATCCGCCGCAATGC-3′). Supplemental Table S4 shows the units of all tandem repeats in five large assembly gaps.

**Figure 3.**
New genomic regions in VC2010 assembly. (A) Subdivision of sequence classes causing the 1.8-Mb increase in genome size from N2 assembly to VC2010. Large tandem repeat expansions (of size >1 kb) are predominant, accounting for 85% of the increased VC2010 DNA. Other sequence classes include insertions (>100 nt), duplications (>100 nt), and telomere repeats. Tandem repeats are divided into some with clear repeat units and others (“imperfect”) without them (Supplemental Fig. S7). (B) Phylogenetic tree of N2, VC2010 (PD1074), and outgroup strains CB4856 (PD2182) and MY2 (PD2183). (C) The yellow-colored duplicated region with two copies of a gene in VC2010 is compared with its best matching regions in N2, PD2182, and PD2183. The comparison implies that the duplication was a recent event occurring in the lineage from the original N2 strain to VC2010. Of note, two duplicated regions overlap slightly. (D) Because long reads were unavailable for N2, we compare the regions in VC2010 and PD2183 for which long reads were available, and we show a dot plot between the regions (a similar dot plot between VC2010 and PD2182 is shown in Supplemental Fig. S12). To confirm the correctness of both regions, we align raw PacBio reads collected from VC2010 and PD2183 to their respective genomic regions, and the alignments are shown as blue lines *below* the x-axis and to the *right* of the y-axis. Indeed, a number of alignments span and validate the focal duplicated region and its matching region. (E) A comparison of regions where VC2010, PD2182, and PD2183 coincide, but the green-colored region is missing in the N2 reference assembly, implying that the segment had been lost in culturing animals or clones used for the N2 assembly or in the original N2 assembly process. (F) As in D, aligning raw PacBio reads to both regions in VC2010 and PD2183 shows their validity (a similar dot plot between VC2010 and PD2182 is shown in Supplemental Fig. S10). (G) Frequencies of apparent insertions into VC2010 (missing in N2), deletions from VC2010 (surplus in N2), and genome duplications (in N2 or VC2010), sorted into three categories: 97 assembly errors in the N2 genome, 19 variants that arose in the lineage from N2 to VC2010, and 20 undetermined cases because of inconsistency among the four genomes. We categorized individual large variants by inspecting the dot plots in Supplemental Figures S10–S12 (Supplemental Tables S16–S18). Of the 97 assembly errors, 89 (92%) were regions missing in the N2 reference assembly.

**Figure 4.**
New exons and genes in the VC2010 assembly. Segments of the VC2010 assembly are shown with N2-derived gene predictions, independent AUGUSTUS-derived gene predictions, and VC2010-specific DNA regions. For each gene, alternative transcript isoforms (if any) are shown. (A) Extra, VC2010-assembly-specific exons in the gene *cpsf-1*/*Y76B12C.7* (alias *chrIV_pilon.g9758*) (Supplemental Table S21). (B) *chrII_pilon.g6413*, a likely new gene encoded entirely by VC2010-specific DNA; BLASTP shows this to be a paralog of *T18D3.9*/*MPV17* in the N2 reference assembly but an ortholog of *Cnig_chr_II.g6634* in the PacBio-sequenced *C. nigoni*. Surrounding AUGUSTUS predictions in genomic DNA shared with N2 match N2 reference gene structures closely. (C) *chrX_pilon.g18545*, a paralog of *hasp-1*/*C01H6.9* encoded largely by VC2010-assembly-specific DNA. The latter two genes are listed in Supplemental Table S23.

See this image and copyright information in PMC

Comment in

A new reference genome sequence for Caenorhabditis elegans?
Howe KL. Howe KL. Lab Anim (NY). 2019 Sep;48(9):267-268. doi: 10.1038/s41684-019-0371-1. Lab Anim (NY). 2019. PMID: 31406358 Free PMC article. No abstract available.

Cited by

Units containing telomeric repeats are prevalent in subtelomeric regions of a Mesorhabditis isolate collected from the Republic of Korea.
Kim S, Kim J. Kim S, et al. Genes Genomics. 2024 Dec;46(12):1461-1472. doi: 10.1007/s13258-024-01576-w. Epub 2024 Oct 4. Genes Genomics. 2024. PMID: 39367283
Sensory neurons couple arousal and foraging decisions in Caenorhabditis elegans.
Scheer E, Bargmann CI. Scheer E, et al. Elife. 2023 Dec 27;12:RP88657. doi: 10.7554/eLife.88657. Elife. 2023. PMID: 38149996 Free PMC article.
GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads.
Awad M, Gan X. Awad M, et al. Nat Commun. 2023 Jan 13;14(1):204. doi: 10.1038/s41467-022-35670-y. Nat Commun. 2023. PMID: 36639368 Free PMC article.
A ubiquinone precursor analogue does not clearly increase the growth rate of Caenorhabditis inopinata.
Woodruff GC, Moser KA. Woodruff GC, et al. MicroPubl Biol. 2024 Dec 5;2024:10.17912/micropub.biology.001235. doi: 10.17912/micropub.biology.001235. eCollection 2024. MicroPubl Biol. 2024. PMID: 39712935 Free PMC article.
Expansion of the split hygromycin toolkit for transgene insertion in Caenorhabditis elegans.
Moerdyk-Schauwecker MJ, Jahahn EK, Muñoz ZI, Robinson KJ, Phillips PC. Moerdyk-Schauwecker MJ, et al. MicroPubl Biol. 2024 Jan 29;2024:10.17912/micropub.biology.001091. doi: 10.17912/micropub.biology.001091. eCollection 2024. MicroPubl Biol. 2024. PMID: 38351905 Free PMC article.

See all "Cited by" articles

References

1. Alkan C, Sajjadian S, Eichler EE. 2011. Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. 10.1038/nmeth.1527 - DOI - PMC - PubMed
1. Azzalin CM, Reichenbach P, Khoriauli L, Giulotto E, Lingner J. 2007. Telomeric repeat containing RNA and RNA surveillance factors at mammalian chromosome ends. Science 318: 798–801. 10.1126/science.1147182 - DOI - PubMed
1. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. 10.1093/nar/27.2.573 - DOI - PMC - PubMed
1. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. 10.1038/nbt.3238 - DOI - PubMed
1. Bessereau JL. 2006. Transposons in C. elegans. WormBook 1–13. 10.1895/wormbook.1.70.1 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Recompleting the Caenorhabditis elegans genome

Affiliations

Recompleting the Caenorhabditis elegans genome

Authors

Affiliations

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous