Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2024 Dec 6:2024.12.04.626850.
doi: 10.1101/2024.12.04.626850.

CGC1, a new reference genome for Caenorhabditis elegans

Affiliations

CGC1, a new reference genome for Caenorhabditis elegans

Kazuki Ichikawa et al. bioRxiv. .

Update in

  • CGC1, a new reference genome for Caenorhabditis elegans.
    Ichikawa K, Shoura MJ, Artiles KL, Jeong DE, Owa C, Kobayashi H, Suzuki Y, Kanamori M, Toyoshima Y, Iino Y, Rougvie AE, Wahba L, Fire AZ, Schwarz EM, Morishita S. Ichikawa K, et al. Genome Res. 2025 Aug 1;35(8):1902-1918. doi: 10.1101/gr.280274.124. Genome Res. 2025. PMID: 40664475 Free PMC article.

Abstract

The original 100.3 Mb reference genome for Caenorhabditis elegans, generated from the wild-type laboratory strain N2, has been crucial for analysis of C. elegans since 1998 and has been considered complete since 2005. Unexpectedly, this long-standing reference was shown to be incomplete in 2019 by a genome assembly from the N2-derived strain VC2010. Moreover, genetically divergent versions of N2 have arisen over decades of research and hindered reproducibility of C. elegans genetics and genomics. Here we provide a 106.4 Mb gap-free, telomere-to-telomere genome assembly of C. elegans, generated from CGC1, an isogenic derivative of the N2 strain. We used improved long-read sequencing and manual assembly of 43 recalcitrant genomic regions to overcome deficiencies of prior N2 and VC2010 assemblies, and to assemble tandem repeat loci including a 772-kb sequence for the 45S rRNA genes. While many differences from earlier assemblies came from repeat regions, unique additions to the genome were also found. Of 19,972 protein-coding genes in the N2 assembly, 19,790 (99.1%) encode products that are unchanged in the CGC1 assembly. The CGC1 assembly also may encode 183 new protein-coding and 163 new ncRNA genes. CGC1 thus provides both a completely defined reference genome and corresponding isogenic wild-type strain for C. elegans, allowing unique opportunities for model and systems biology.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Overlap-layout-consensus approach of assembling tandem repeat regions.
a, The overlap and layout step finds one or more Nanopore ultralong reads that span a focal gap, and lays out multiple reads properly. b, The consensus step corrects errors in the scaffold of Nanopore reads (red), aligns other Nanopore reads (green) to the scaffold, and calculates the consensus of the aligned reads. c, To further eliminate errors in the consensus sequence (green), HiFi reads (blue) are aligned to the consensus, and used to generate the consensus of mapped HiFi reads.
Figure 2.
Figure 2.. Polishing two complex tandem repeat regions in chromosomes I and II.
a-b, Dot plots of two Nanopore reads that span neighboring tandem repeats in chromosome I (4,715,580-4,817,023) (a) and chromosome II (14,800,218-14,842,635) (b). c, The top row shows a spanning Nanopore read (colored red) and the coverage of other Nanopore reads mapped to the spanning read in chromosome I. The bar for each base represents the distribution of aligned bases in the consensus. If one nucleotide accounts for more than 80%, the bar is colored gray; otherwise, the four nucleotides A, C, G, and T are green, blue, orange, and red, respectively. The 2nd row shows coverage of HiFi reads mapped to the Nanopore consensus (light green). The third and fourth rows present coverages of HiFi reads and Nanopore reads to the HiFi consensus (light blue). Bars in the 3rd row are gray, showing the consistency of HiFi reads with the HiFi consensus. The 4th row shows that there are many bars with Cs (blue) as the majority and Ts (red) as the minority, but the base in the consensus of all bars is C. d, Read coverage of a spanning Nanopore read, Nanopore consensus, and HiFi consensus for complex tandem repeat regions in chromosome II.
Figure 3.
Figure 3.. Similarity of prominent 27-mer tandem repeat regions.
Dot plot of 12 prominent tandem repeat genomic regions sharing a common 27-mer repeat unit. For example, the top row shows a 21,489-nt genomic region of chromosome V (1,221,871-1,243,359) on the + (Watson) strand. Dots represent perfect matches of length 54 nt. For example, the dot plot for the fourth genomic region of the X chromosome (122,781-250,556) is very dense and appears black due to the high similarity between the 54-mers. In contrast, white areas indicate discordance. Regions are grouped by sequence similarity rather than their genomic position. The green box contains four genomic regions that share the most frequent version of the 27-mer unit (5’-ACT CTC TGT GGC TTC CCA CTA TAT TTT-3’), which we call type1. While, the blue box contains five genomic regions that share many copies of both type1 and another version of the 27-mer, type2 (5’-ACT CTC TGT GGC TTC CCA CCA TAT TTT-3’) that has one (underlined) T-to-C substitution with respect to type1.
Figure 4.
Figure 4.. Structure of the pSX1 array.
a, Self-dot plot of a region with the 153-kb pSX1 array in chromosome X (441,085-594,508). Dots represent perfect matches of substrings of length 100 nt. b, On the left side of the table, the phylogenetic tree shows the proximity of variants in terms of sequence similarity. On the right side of the table, the colored bars indicate the locations of the occurrences of each variant. Characteristics of eleven frequent variants of the reference pSX1 (172 nt, orange): the number of occurrences of each variant, the ratio of number of occurrences to total, number of mismatches with the reference pSX1, percentage match with the reference, positions of base substitutions (e.g., 171A means the base at position 171 is substituted with A), and positions with insertion bases (e.g., 51TCT shows TCT is inserted at position 51), and the statistical significance codes for p-values (***<=0.1%, **<=1%, and *<=5%) such that each variant occurs adjacent to each other preferentially according to the Wald–Wolfowitz runs test.
Figure 5.
Figure 5.. 45S rDNA array at the right end of Chromosome I.
a. Single nucleotide variants (SNVs) detected in the 45S rDNA representative repeat unit of size 7197 nt. 3322 PacBio HiFi reads were aligned to the 45S rDNA representative repeat unit. Because each HiFi read was long enough for the 45S rDNA unit to occur approximately twice, HiFi read coverage in the 45S rDNA unit averaged 6691 at 7197 positions. The table shows SNVs whose minor nucleotides are detected 60 or more times in PacBio HiFi reads. b, Nanopore reads are aligned so that distances of adjacent SNVs and insertions within Nanopore reads are consistent between aligned reads. c. Consensus of aligned Nanopore reads, an assembly of the 45S rDNA array with 107 units. Each position in the consensus is covered by two or more Nanopore reads shown in Figure b. d. Histogram displays read coverage for each position within the consensus by Nanopore reads >100 kb in length.
Figure 6.
Figure 6.. Transcription from a tandem repeat in various tissue types.
This shows an alignment of long RNA-seq reads to a tandem repeat in chromosome II (nt 11,590,316-11,597,644). The topmost rows show mRNA isoform splicing for genes in the region, with srap-1 containing the tandem repeat. Below, three tracks show alignments of tissue type data from Li et al. (Li et al. 2020) (GSE130044, NCBI Gene Expression Omnibus); six further tracks show tissue type data from Roach et al. (Roach et al. 2020) (PRJEB31791, ENA). Alignment on tandem repeats and introns in srap-1 were consistently observed in different tissue types, even though the RNA-seq data were collected in two independent studies, supporting the presence of transcription.

References

    1. Alkan C., Sajjadian S. and Eichler E. E., 2011. Limitations of next-generation genome sequence assembly. Nat Methods 8: 61–65. - PMC - PubMed
    1. Angiuoli S. V., and Salzberg S. L., 2011. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27: 334–342. - PMC - PubMed
    1. Awad M., and Gan X., 2023. GALA: a computational framework for de novo chromosome-by-chromosome assembly with long reads. Nat Commun 14: 204. - PMC - PubMed
    1. Biswas S., Gurdziel K. and Meller V. H., 2024. siRNA that participates in Drosophila dosage compensation is produced by many 1.688X and 359 bp repeats. Genetics 227. - PMC - PubMed
    1. Boeke J. D., Church G., Hessel A., Kelley N. J., Arkin A. et al. 2016. Genome engineering: the Genome Project-Write. Science 353: 126–127. - PubMed

Publication types

LinkOut - more resources