Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 11;53(2):gkae1177.
doi: 10.1093/nar/gkae1177.

Comprehensive genome annotation of the model ciliate Tetrahymena thermophila by in-depth epigenetic and transcriptomic profiling

Affiliations

Comprehensive genome annotation of the model ciliate Tetrahymena thermophila by in-depth epigenetic and transcriptomic profiling

Fei Ye et al. Nucleic Acids Res. .

Abstract

The ciliate Tetrahymena thermophila is a well-established unicellular model eukaryote, contributing significantly to foundational biological discoveries. Despite its acknowledged importance, current studies on Tetrahymena biology face challenges due to gene annotation inaccuracy, particularly the notable absence of untranslated regions (UTRs). To comprehensively annotate the Tetrahymena macronuclear genome, we collected extensive transcriptomic data spanning various cell stages. To ascertain transcript orientation and transcription start/end sites, we incorporated data on epigenetic marks displaying enrichment towards the 5' end of gene bodies, including H3 lysine 4 tri-methylation (H3K4me3), histone variant H2A.Z, nucleosome positioning and N6-methyldeoxyadenine (6mA). Cap-seq data was subsequently applied to validate the accuracy of identified transcription start sites. Additionally, we integrated Nanopore direct RNA sequencing (DRS), strand-specific RNA sequencing (RNA-seq) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) data. Using a newly developed bioinformatic pipeline, coupled with manual curation and experimental validation, our work yielded substantial improvements to the current gene models, including the addition of 2,481 new genes, updates to 23,936 existing genes, and the incorporation of 8,339 alternatively spliced isoforms. Furthermore, novel UTR information was annotated for 26,687 high-confidence genes. Intriguingly, 20% of protein-coding genes were identified to have natural antisense transcripts characterized by high diversity in alternative splicing, thus offering insights into understanding transcriptional regulation. Our work will enhance the utility of Tetrahymena as a robust genetic toolkit for advancing biological research, and provides a promising framework for genome annotation in other eukaryotes.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
Schematic overview of gene model optimization by integrating transcriptomic and epigenetic data. (A) Transcripts at different stages of growth, starvation and conjugation were assembled into draft v1. By comparing newly assembled transcripts with those from TGD2021, well-annotated genes were retained, and error genes were optimized with the assistance of Nanopore DRS and ssRNA-seq data, resulting in draft v2. (B) Epigenetic data were integrated to predict TSSs using a RF model, and TSSs were further categorized into eTSSs and pTSSs with the addition of ATAC-seq data. Cap-seq data was subsequently applied to validate the accuracy of identified TSSs. Further optimization of the gene model was achieved using eTSSs, resulting in draft v3. (C) Transcripts in draft v3 were subjected to ORF prediction, and UTR information was provided based on information of CDS, TSSs and TESs, resulting in draft v4. Features of regulatory elements including promoters, poly-A sequences and PASs were analyzed. (D) The draft gene model v4 underwent two rounds of manual curation, followed by additional genome polish and protein function annotation, resulting in the generation of an improved gene model, draft v5. (E) Annotation of alternatively spliced (AS) isoforms was performed by integrating RNA-seq and Nanopore DRS data, resulting in TGD2024 (updated). NATs were annotated based on the updated gene model. TGD, Tetrahymena genome database; NFR, nucleosome-free region; PAS, polyadenylation signal.
Figure 2.
Figure 2.
Integrative Genomics Viewer (IGV) snapshots showing five categories of gene models optimized by transcriptomic data, including new gene (A), exon-altered gene (B), fused gene (C), partitioned gene (D) and orientation-reversed gene (E). Low-confidence genes (F) were not supported by RNA-seq data, thus retaining their annotations in draft v2 as in TGD2021. Tracks from top to bottom were RNA-seq (growth, starvation for 24 h, and conjugation at 4, 5, 6, 8 and 10 h), Nanopore DRS coverage and reads alignment, and the gene models of draft v2 and TGD2021. Reads and gene models in pink represented transcription on the sense strand, and those in purple on the antisense strand.
Figure 3.
Figure 3.
Gene model optimization using epigenetic information. (A) Prediction of eTSSs and pTSSs using epigenetic data and validation using Cap-seq data. Genes were classified into different groups according to the presence or absence of eTSSs and pTSSs. (B) Distribution profiles of H3K4me3, H2A.Z, 6mA and nucleosome on the gene body. Genes were scaled to unit length and were extended to each side by 1kb length. Note that all four marks were accumulated downstream of TSS, towards the 5′ end of the gene body. (C) The receiver operating characteristics-area under the curve (ROC-AUC) measuring the performance of our RF model. The ROC was a probability curve and AUC represented the degree or measure of separability. The higher the AUC, the better the model was at predicting ‘TSS-region’ classes as ‘TSS-region’ classes or ‘not-TSS-region’ classes. The AUC for both the training data and the testing data was close to 1, indicating excellent performance of our RF model in predicting TSS-region. (D–H). IGV snapshots of seven types of gene models optimized by epigenetic data with the complementation of transcriptomic data, including new gene (C), orientation-reversed gene (D), TSS-altered gene (E), fused gene (F) and partitioned gene (G). Partitioned gene was further subcategorized as co-directional (a), tail-to-tail (b) and head-to-head (c). The tracks from top to bottom were epigenetic information including NFR deduced from ATAC-seq, H3K4me3, H2A.Z, 6mA and nucleosome and RNA-seq transcripts of different cell stages. The most highly expressed transcripts in conjugation were selected. The arrows pointing to the left in DRS reads and the gene model represent transcription on the Watson strand, while those pointing to the right represent transcription on the Crick strand.
Figure 4.
Figure 4.
UTR annotation and regulatory elements analysis. (A) Schematics for UTR annotation. ORF prediction was conducted on top of draft v3, resulting in a total of 26,961 protein-coding genes. 689 genes lacking ORF were defined as potential non-coding RNA. A putative ORF was defined as amino acids sequence longer than 100 aa. UTR information was further supplemented based on predicted TSSs and TESs. A total of 274 low-confidence genes defined in Figure 2F lacked UTR annotations. Draft v3 after ORF prediction and UTR annotation generated draft v4. (B) UTR comparisons between draft v4 and TGD2014, the latter of which contained UTR information for 1,477 genes. Mann–Whitney test was performed. ****P< 0.0001, **P< 0.01. (C) Enriched core promoter motifs in promoter proximal sequences around TSS were identified by Homer (52). P-values represented the statistical significance of motif enrichment, indicating the likelihood that the observed frequency of each motif in the specified genomic region was greater than what would be expected by chance. (D) Venn diagram showing the composition of sequence motifs around polyadenylation signal (PAS). AATAAA was identified as the most predominant motif. (E) Summary of nucleotide frequencies and main regulatory elements around cleavage sites. Cleavage sites were significantly associated with the AT motif. The dashed black line represented positions of cleavage sites. (F) Length distribution of poly-A tails identified in Nanopore DRS (minimum reads > 5). The median length was 18 nt, illustrated by a dashed black line. (G) Distribution of the maximal poly-A tail length for each gene (number of gene with poly-A = 21,660). All genes were sorted by the length of their longest poly-A tails from shortest to longest and divided into three groups: (1) the first 25% of genes, defined as short-tailed genes, with poly-A tail length ranging from 5 to 19 nt; (2) the middle 25%–75% of genes, defined as medium-tailed genes, with poly-A tail length between 19 and 239 nt; (3) the remaining 25% of genes, defined as long-tailed genes, with poly-A tail length exceeding 239 nt. (H) GO analysis revealed that short-tailed and long-tailed genes were enriched in distinct functional groups. The colored bars represented the percentage of genes in each tail-length category. (I) Distribution of poly-A tail length in different functional groups. Mann–Whitney test was performed. ****P < 0.0001, **P < 0.01, and no significance (ns) P > 0.05. (J) The Spearman's correlation between poly-A tail length and gene expression level (rho = 0.72, P< 2.2e-16). The longest poly-A tail was selected as the representative for each gene. Gene expression levels were quantified using the number of Nanopore DRS reads, with the removal of interference from antisense RNA. Both axes were plotted on a logarithmic scale.
Figure 5.
Figure 5.
Polished and comprehensive annotation of the MAC genome through manual curation. (A) Illustration of manual curation and genome polish on draft v4, resulting in draft v5. Two rounds of manual curation were conducted for all 180 non-rDNA chromosomes, focusing on genes with more than one eTSS, as well as those with neither eTSS nor pTSS. Genome polish was conducted by correcting error sites identified through manual curation using Illumina sequencing data. (B) An IGV snapshot showing the manual curation of a multi-eTSS gene, based on epigenetic and transcriptomic data. The tracks from top to bottom were open chromatin, H3K4me3, H2A.Z, 6mA, nucleosome, RNA-seq transcripts from different cell stages and Nanopore DRS transcripts. The arrows and dashed lines indicated positions of eTSSs. (C) An IGV snapshot showing the manual curation of a tandem duplicate gene, by incorporating RNA-seq transcripts of different cell stages with its corresponding reads alignment and Nanopore DRS reads. The arrows indicated the chimeric alignment of RNA-seq transcripts. (D) An IGV snapshot showing the manual curation of a universal AS gene, by incorporating RNA-seq transcripts of different cell stages with its corresponding reads alignment and Nanopore DRS reads. In the magnified box on the right, arrows indicated the universal AS site. These universal AS genes were annotated with their most dominant isoforms. (E) An IGV snapshot showing that multi-short-exon genes were always error-assembled when using Nanopore DRS data. This manual curation was performed with the aid of RNA-seq data from multiple stages. In the magnified box on the bottom, arrows indicated error-assembled sites. (F) An IGV snapshot showing an insertion site located in the exon resulted in erroneous CDS predictions. This manual curation was supported by both Illumina and transcriptomic data. The arrow and box indicated the insertion site.
Figure 6.
Figure 6.
Annotation of AS isoforms in Tetrahymena. (A) The representative display of gene models and IGV snapshots of Nanopore DRS reads for six different AS types. (B) Comparative summary of gene and isoform numbers in each of the six different AS types in TGD02021 and TGD2024. (C) A heatmap depicting the expression profiles of AS transcripts across different stages: growth, starvation for 24 h and conjugation at 4, 5, 6, 8 and 10 h. (D) A representative gene exhibiting a stage-specific tendency for IR, supported by transcriptomic data (left), as well as the ratio of retained intron and the ratio of the intron-containing reads (right). The ratio of retained intron was defined as the retained intron number divided by the total sequenced intron number of the gene. The ratio of the intron-containing reads was defined as the reads aligned to the intron divided by the total reads aligned to the gene.
Figure 7.
Figure 7.
Identification and characterization of five types of NATs in Tetrahymena. (A) Schematics for NATs annotation. NATs were identified on the TGD2024 using both transcriptomic and epigenetic data. Identified NATs were further categorized based on their relative positions to corresponding sense transcripts. (B–E) IGV snapshots showing five types of NATs. They included promoter NATs (B), originating from shared bidirectional promoters of the sense transcripts; type 1 exonic NATs (C), located within 1 kb downstream of the TSSs of the sense transcripts and sharing epigenetic marks with their sense transcripts; type 2 exonic NATs (D), located > 1 kb downstream of the TSSs of the sense transcripts; and intronic NATs (E), transcribed from the intronic regions of sense transcripts. (F) Distribution profiles of H3K4me3, H2A.Z, 6mA and well-positioned nucleosomes on the transcript body of NATs. Transcripts were scaled to unit length and were extended to each side by 1 kb length. (G) Box plot representation of normalized read counts for sense transcripts and NATs during growth, starvation and conjugation. (H) An IGV snapshot showing the anti-correlation of temporal expression patterns between a NAT and its corresponding sense transcript (left). The line chart (right) depicted the proportion of expression level for sense and antisense transcripts at different time points. The error bar represented the standard deviation. (I) The box plot showing that the ASD of NATs exceeded that of their sense transcripts (the median of NATs and sense transcripts were 0.96 and 0.28, respectively). Mann–Whitney was performed. ****P< 0.0001. ASD was defined as the number of different types of splice sites divided by the total reads aligned to the NATs or sense transcripts.

References

    1. Mochizuki K., Fine N.A., Fujisawa T., Gorovsky M.A.. Analysis of a piwi-related gene implicates small RNAs in genome rearrangement in Tetrahymena. Cell. 2002; 110:689–699. - PubMed
    1. Mochizuki K., Gorovsky M.A.. Conjugation-specific small RNAs in Tetrahymena have predicted properties of scan (scn) RNAs involved in genome rearrangement. Genes Dev. 2004; 18:2068–2073. - PMC - PubMed
    1. Collins K., Gorovsky M.A.. Tetrahymena thermophila. Curr. Biol. 2005; 15:R317–R318. - PubMed
    1. Gao S., Xiong J., Zhang C., Berquist B.R., Yang R., Zhao M., Molascon A.J., Kwiatkowsk S.Y., Yuan D., Qin Z.. Impaired replication elongation in Tetrahymena mutants deficient in histone H3 Lys 27 monomethylation. Genes Dev. 2013; 27:1662–1679. - PMC - PubMed
    1. Xiong J., Gao S., Dui W., Yang W., Chen X., Taverna S.D., Pearlman R.E., Ashlock W., Miao W., Liu Y.. Dissecting relative contributions of cis- and trans-determinants to nucleosome distribution by comparing Tetrahymena macronuclear and micronuclear chromatin. Nucleic Acids Res. 2016; 44:10091–10105. - PMC - PubMed

LinkOut - more resources