Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 1;35(8):1902-1918.
doi: 10.1101/gr.280274.124.

CGC1, a new reference genome for Caenorhabditis elegans

Affiliations

CGC1, a new reference genome for Caenorhabditis elegans

Kazuki Ichikawa et al. Genome Res. .

Abstract

The original 100.3 Mb reference genome for Caenorhabditis elegans, generated from the wild-type laboratory strain N2, has been crucial for analysis of C. elegans since 1998 and has been considered complete since 2005. Unexpectedly, this long-standing reference was shown to be incomplete in 2019 by a genome assembly from the N2-derived strain VC2010. Moreover, genetically divergent versions of N2 have arisen over decades of research and hindered reproducibility of C. elegans genetics and genomics. Here we provide a 106.4 Mb gap-free, telomere-to-telomere genome assembly of C. elegans, generated from CGC1, an isogenic derivative of the N2 strain. We use improved long-read sequencing and manual assembly of 43 recalcitrant genomic regions to overcome deficiencies of prior N2 and VC2010 assemblies and to assemble tandem repeat loci, including a 772 kb sequence for the 45S rRNA genes. Although many differences from earlier assemblies come from repeat regions, unique additions to the genome are also found. Of 19,972 protein-coding genes in the N2 assembly, 19,790 (99.1%) encode products that are unchanged in the CGC1 assembly. The CGC1 assembly also may encode 183 new protein-coding and 163 new ncRNA genes. CGC1 thus provides both a completely defined reference genome and corresponding isogenic wild-type strain for C. elegans, allowing unique opportunities for model and systems biology.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The overlap-layout-consensus approach of assembling tandem repeat regions. (A) The overlap-and-layout step finds one or more Nanopore ultralong reads that span a focal gap and lays out multiple reads properly. (B) The consensus step corrects errors in the scaffold of Nanopore reads (red), aligns other Nanopore reads (green) to the scaffold, and calculates the consensus of the aligned reads. (C) To further eliminate errors in the consensus sequence (green), HiFi reads (blue) are aligned to the consensus and used to generate the consensus of mapped HiFi reads. (D) Comparison between the N2 and CGC1 assemblies. (#) The number of contigs, coding genes, and so on. The N2 genome assembly version is WBPS19 WBcel235 (GCA_000002985.3) generated in December 2012.
Figure 2.
Figure 2.
Polishing two complex tandem repeat regions in CHROMOSOME_I and CHROMOSOME_II. (A,B) Self-to-self dot plots of two Nanopore reads that span neighboring tandem repeats in CHROMOSOME_I (4,715,580–4,817,023; A) and CHROMOSOME_II (14,800,218–14,842,635; B). (C) The top row shows a spanning Nanopore read (colored red) and the coverage of other Nanopore reads mapped to the spanning read in CHROMOSOME_I. The bar for each base represents the distribution of aligned bases in the consensus. If one nucleotide accounts for >80%, the bar is colored gray; otherwise, the four nucleotides A, C, G, and T are green, blue, orange, and red, respectively. The second row shows coverage of HiFi reads mapped to the Nanopore consensus (light green). The third and fourth rows present coverages of HiFi reads and Nanopore reads to the HiFi consensus (light blue). Bars in the third row are gray, showing the consistency of HiFi reads with the HiFi consensus. The fourth row shows that there are many bars with C's (blue) as the majority and T's (red) as the minority, but the base in the consensus of all bars is C. (D) Read coverage of a spanning Nanopore read, Nanopore consensus, and HiFi consensus for complex tandem repeat regions in CHROMOSOME_II.
Figure 3.
Figure 3.
Similarity of prominent 27-mer tandem repeat regions. Self-to-self dot plot of 12 prominent tandem repeat genomic regions sharing a common 27-mer repeat unit. For example, the top row shows a 21,489 nt genomic region of CHROMOSOME_V (1,221,871–1,243,359) on the + (Watson) strand. Dots represent perfect matches of length 54 nt. For example, the dot plot for the fourth genomic region of CHROMOSOME_X (122,781–250,556) is very dense and appears black owing to the high similarity between the 54-mers. In contrast, white areas indicate discordance. Regions are grouped by sequence similarity rather than their genomic position. The green box contains four genomic regions that share the most frequent version of the 27-mer unit (5′-ACTCTCTGTGGCTTCCCACTATATTTT-3′), which we call type1. The blue box contains five genomic regions that share many copies of both type1 and another version of the 27-mer, type2 (5′-ACTCTCTGTGGCTTCCCACCATATTTT-3′), that has one (underlined) T-to-C substitution with respect to type1.
Figure 4.
Figure 4.
Structure of the pSX1 array. (A) Self-to-self dot plot of a region with the 153 kb pSX1 array in CHROMOSOME_X (441,085–594,508). Dots represent perfect matches of substrings of length 100 nt. (B) On the left side of the table, the phylogenetic tree shows the proximity of variants in terms of sequence similarity. On the right side of the table, the colored bars indicate the locations of the occurrences of each variant. Characteristics of 11 frequent variants of the reference pSX1 (172 nt; orange): the number of occurrences of each variant (# Occ.), the ratio of number of occurrences to total (Occ. ratio), number of mismatches with the reference pSX1 (# Mis.), percentage match with the reference (match ratio), positions of base substitutions (e.g., 171A means the base at position 171 is substituted with A), positions with insertion bases (e.g., 51TCT shows TCT is inserted at position 51), and the statistical significance codes for P-values. (***) P ≤ 0.1%, (**) P ≤ 1%, and (*) P ≤ 5% such that each variant occurs adjacent to each other preferentially according to the Wald–Wolfowitz runs test.
Figure 5.
Figure 5.
45S rDNA array at the right end of CHROMOSOME_I. (A) Single-nucleotide variants (SNVs) detected in the 45S rDNA representative repeat unit of size 7197 nt; 3322 PacBio HiFi reads were aligned to the 45S rDNA representative repeat unit. Because each HiFi read was long enough for the 45S rDNA unit to occur approximately twice, HiFi read coverage in the 45S rDNA unit averaged 6691 at 7197 positions. The table shows SNVs whose minor nucleotides are detected 60 or more times in PacBio HiFi reads. (B) Nanopore reads are aligned so that distances of adjacent SNVs and insertions within Nanopore reads are consistent between aligned reads. (C) Consensus of aligned Nanopore reads, an assembly of the 45S rDNA array with 107 units. Each position in the consensus is covered by two or more Nanopore reads shown in B. (D) Histogram displays read coverage for each position within the consensus by Nanopore reads >100 kb in length.
Figure 6.
Figure 6.
Transcription from a tandem repeat in various tissue types. This shows an alignment of long RNA-seq reads to a tandem repeat in CHROMOSOME_II (nt 11,590, 316–11,597, 644). The topmost rows show mRNA isoform splicing for genes in the region, with srap-1 containing the tandem repeat. (Below) Three tracks show alignments of tissue type data from Li et al. (2020) (NCBI Gene Expression Omnibus [GEO; https://www.ncbi.nlm.nih.gov/geo/], accession number GSE130044); six additional tracks show tissue type data from Roach et al. (2020) (European Nucleotide Archive [ENA; https://www.ebi.ac.uk/ena/browser/home] accession number PRJEB31791). The alignment on tandem repeats and introns in srap-1 was consistently observed in different tissue types, even though the RNA-seq data were collected in two independent studies, supporting the presence of transcription.

Update of

References

    1. Alkan C, Sajjadian S, Eichler EE. 2011. Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. 10.1038/nmeth.1527 - DOI - PMC - PubMed
    1. Angiuoli SV, Salzberg SL. 2011. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27: 334–342. 10.1093/bioinformatics/btq665 - DOI - PMC - PubMed
    1. Antipov D, Rautiainen M, Nurk S, Walenz BP, Solar SJ, Phillippy AM, Koren S. 2025. Verkko2 integrates proximity-ligation data with long-read De Bruijn graphs for efficient telomere-to-telomere genome assembly, phasing, and scaffolding. Genome Res 35: 1583–1594. 10.1101/gr.280383.124 - DOI - PMC - PubMed
    1. Awad M, Gan X. 2023. GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads. Nat Commun 14: 204. 10.1038/s41467-022-35670-y - DOI - PMC - PubMed
    1. Biswas S, Gurdziel K, Meller VH. 2024. siRNA that participates in Drosophila dosage compensation is produced by many 1.688X and 359 bp repeats. Genetics 227: iyae074. 10.1093/genetics/iyae074 - DOI - PMC - PubMed

LinkOut - more resources