Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun;29(6):1009-1022.
doi: 10.1101/gr.244830.118. Epub 2019 May 23.

Recompleting the Caenorhabditis elegans genome

Affiliations

Recompleting the Caenorhabditis elegans genome

Jun Yoshimura et al. Genome Res. 2019 Jun.

Abstract

Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. elegans available today. To provide a more accurate C. elegans genome, we performed long-read assembly of VC2010, a modern strain derived from N2. Our VC2010 assembly has 99.98% identity to N2 but with an additional 1.8 Mb including tandem repeat expansions and genome duplications. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2. Over 98% of N2 genes encoded unchanged products in VC2010; moreover, we predicted ≥53 new genes in VC2010. The recompleted genome of C. elegans should be a valuable resource for genetics, genomics, and systems biology.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Steps for detecting and filling gaps. (A) Contigs are ordered along the N2 reference assembly. Parts shown as dangling (colored light orange) fail to align and are missing in the N2 reference. (B) At a gap, regions in two Canu contigs (orange) map to proximal loci on the N2 reference; however, the two contigs have dangling end subsequences missing in the reference. In such cases, we estimate gaps between the contigs according to steps illustrated in CG. (C) A single contig in other assemblies (yellow) fills a gap. (D) A long contig in other assemblies combines multiple contigs separated by more than one gap. (E) More than one contig fills a gap. (F) A single error-corrected read (light blue) fills a gap. (G) A hybrid approach of using multiple contigs and error-corrected reads fills a gap.
Figure 2.
Figure 2.
Large gaps closed by long Nanopore reads. (A) Contigs of seven genome assemblies are aligned with Chromosome I of the N2 reference (see layouts for all chromosomes in Supplemental Fig. S2). The respective red and blue thick lines show alignments of contigs in the plus and minus strands. The vertical red line shows a large gap that failed to be filled by seven genome assemblies. (BD) Examples of provisional gap closure using Nanopore data for a region where a long gap was found. (B) A self-dot plot for an initial model in which we ligate the last 30 kb of sequence from a contig just before a gap on Chromosome I (colored red) to 30 kb of sequence from another contig just after that gap. Two black boxes represent long tandem repeat expansions around the gap. (C) A dot plot between a single 92,790-nt Nanopore read (green) that connects the gap and the simple ligation model in B. (D) A self-dot plot of the Nanopore read shows that the two tandem repeats in C were underestimated. In this example, the left tandem repeat (red asterisk) has 1130 copies of a 26-nt unit string (5′-CATTTTTCTAAAATCCGCCGCAATGC-3′). Supplemental Table S4 shows the units of all tandem repeats in five large assembly gaps.
Figure 3.
Figure 3.
New genomic regions in VC2010 assembly. (A) Subdivision of sequence classes causing the 1.8-Mb increase in genome size from N2 assembly to VC2010. Large tandem repeat expansions (of size >1 kb) are predominant, accounting for 85% of the increased VC2010 DNA. Other sequence classes include insertions (>100 nt), duplications (>100 nt), and telomere repeats. Tandem repeats are divided into some with clear repeat units and others (“imperfect”) without them (Supplemental Fig. S7). (B) Phylogenetic tree of N2, VC2010 (PD1074), and outgroup strains CB4856 (PD2182) and MY2 (PD2183). (C) The yellow-colored duplicated region with two copies of a gene in VC2010 is compared with its best matching regions in N2, PD2182, and PD2183. The comparison implies that the duplication was a recent event occurring in the lineage from the original N2 strain to VC2010. Of note, two duplicated regions overlap slightly. (D) Because long reads were unavailable for N2, we compare the regions in VC2010 and PD2183 for which long reads were available, and we show a dot plot between the regions (a similar dot plot between VC2010 and PD2182 is shown in Supplemental Fig. S12). To confirm the correctness of both regions, we align raw PacBio reads collected from VC2010 and PD2183 to their respective genomic regions, and the alignments are shown as blue lines below the x-axis and to the right of the y-axis. Indeed, a number of alignments span and validate the focal duplicated region and its matching region. (E) A comparison of regions where VC2010, PD2182, and PD2183 coincide, but the green-colored region is missing in the N2 reference assembly, implying that the segment had been lost in culturing animals or clones used for the N2 assembly or in the original N2 assembly process. (F) As in D, aligning raw PacBio reads to both regions in VC2010 and PD2183 shows their validity (a similar dot plot between VC2010 and PD2182 is shown in Supplemental Fig. S10). (G) Frequencies of apparent insertions into VC2010 (missing in N2), deletions from VC2010 (surplus in N2), and genome duplications (in N2 or VC2010), sorted into three categories: 97 assembly errors in the N2 genome, 19 variants that arose in the lineage from N2 to VC2010, and 20 undetermined cases because of inconsistency among the four genomes. We categorized individual large variants by inspecting the dot plots in Supplemental Figures S10–S12 (Supplemental Tables S16–S18). Of the 97 assembly errors, 89 (92%) were regions missing in the N2 reference assembly.
Figure 4.
Figure 4.
New exons and genes in the VC2010 assembly. Segments of the VC2010 assembly are shown with N2-derived gene predictions, independent AUGUSTUS-derived gene predictions, and VC2010-specific DNA regions. For each gene, alternative transcript isoforms (if any) are shown. (A) Extra, VC2010-assembly-specific exons in the gene cpsf-1/Y76B12C.7 (alias chrIV_pilon.g9758) (Supplemental Table S21). (B) chrII_pilon.g6413, a likely new gene encoded entirely by VC2010-specific DNA; BLASTP shows this to be a paralog of T18D3.9/MPV17 in the N2 reference assembly but an ortholog of Cnig_chr_II.g6634 in the PacBio-sequenced C. nigoni. Surrounding AUGUSTUS predictions in genomic DNA shared with N2 match N2 reference gene structures closely. (C) chrX_pilon.g18545, a paralog of hasp-1/C01H6.9 encoded largely by VC2010-assembly-specific DNA. The latter two genes are listed in Supplemental Table S23.

Comment in

Similar articles

Cited by

References

    1. Alkan C, Sajjadian S, Eichler EE. 2011. Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. 10.1038/nmeth.1527 - DOI - PMC - PubMed
    1. Azzalin CM, Reichenbach P, Khoriauli L, Giulotto E, Lingner J. 2007. Telomeric repeat containing RNA and RNA surveillance factors at mammalian chromosome ends. Science 318: 798–801. 10.1126/science.1147182 - DOI - PubMed
    1. Benson G. 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580. 10.1093/nar/27.2.573 - DOI - PMC - PubMed
    1. Berlin K, Koren S, Chin C-S, Drake JP, Landolin JM, Phillippy AM. 2015. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat Biotechnol 33: 623–630. 10.1038/nbt.3238 - DOI - PubMed
    1. Bessereau JL. 2006. Transposons in C. elegans. WormBook 1–13. 10.1895/wormbook.1.70.1 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources