Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 3:13:997460.
doi: 10.3389/fgene.2022.997460. eCollection 2022.

Prediction of transcript isoforms in 19 chicken tissues by Oxford Nanopore long-read sequencing

Affiliations

Prediction of transcript isoforms in 19 chicken tissues by Oxford Nanopore long-read sequencing

Dailu Guan et al. Front Genet. .

Abstract

To identify and annotate transcript isoforms in the chicken genome, we generated Nanopore long-read sequencing data from 68 samples that encompassed 19 diverse tissues collected from experimental adult male and female White Leghorn chickens. More than 23.8 million reads with mean read length of 790 bases and average quality of 18.2 were generated. The annotation and subsequent filtering resulted in the identification of 55,382 transcripts at 40,547 loci with mean length of 1,700 bases. We predicted 30,967 coding transcripts at 19,461 loci, and 16,495 lncRNA transcripts at 15,512 loci. Compared to existing reference annotations, we found ∼52% of annotated transcripts could be partially or fully matched while ∼47% were novel. Seventy percent of novel transcripts were potentially transcribed from lncRNA loci. Based on our annotation, we quantified transcript expression across tissues and found two brain tissues (i.e., cerebellum and cortex) expressed the highest number of transcripts and loci. Furthermore, ∼22% of the transcripts displayed tissue specificity with the reproductive tissues (i.e., testis and ovary) exhibiting the most tissue-specific transcripts. Despite our wide sampling, ∼20% of Ensembl reference loci were not detected. This suggests that deeper sequencing and additional samples that include different breeds, cell types, developmental stages, and physiological conditions, are needed to fully annotate the chicken genome. The application of Nanopore sequencing in this study demonstrates the usefulness of long-read data in discovering additional novel loci (e.g., lncRNA loci) and resolving complex transcripts (e.g., the longest transcript for the TTN locus).

Keywords: annotation; chicken; long-read sequencing; nanopore; transcript isoform; transcriptome.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Data summary of 68 chicken Nanopore long-read transcriptome datasets. (A) Bivariate plot (De Coster et al., 2018) depicting read length (x-axis) and quality (y-axis) of Nanopore long-read transcriptome reads (B) Hierarchical clustering of 68 chicken Nanopore long-read transcriptome samples used in this study. The dendrogram is built based on gene expressions quantified with transcripts per million (TPM ≥0.1). The distance between individuals is indicated by 1-r, where r is the Pearson correlation coefficient. The red arrow indicates sample Cecum_CA, which did not cluster with other cecal samples. (C) Correlation between the number of sequencing reads (x-axis) and the number of expressed genes (y-axis, TPM >0.1). The Pearson’s correlation is 0.71 (p = 1.30 × 10−11).
FIGURE 2
FIGURE 2
Transcript assembly using Nanopore long-read transcriptome data. (A) Comparisons of predicted transcripts against Ensembl (V102, vsEMBL) and NCBI annotation (V105, vsNCBI). The transcripts were classified according to the GffCompare software (Pertea and Pertea, 2020). The panels (B,C) depict the distributions of predicted transcript length and exon numbers, respectively. (D) A screenshot showing the predicted longest transcript, which is located on chromosome 7 (15,343,033-15,384,347). Blast analysis indicated the transcript matched to the TTN gene locus encoding the titin protein.
FIGURE 3
FIGURE 3
Characterization of assembled transcripts. (A) Number of loci in NCBI (V105), Ensembl (V102) and our annotations. (B) Pie chart depicting GffCompare types to Ensembl annotation (V102). (C) Number of transcripts as a function of protein-coding, lncRNA, and other non-coding loci. (D) Transcript expression measured as transcript per million (TPM) as a function of different types of transcripts classified by GffCompare tool. Exact match: GffCompare code “=”, which means the intron chains of our annotated transcripts can exactly match to reference annotations; Novel isoform: GffCompare codes ‘c,’ ‘k,’ ‘j,’ ‘m,’ ‘n,’ or ‘o’, which means predicted transcript cannot match a reference transcript but can match a reference gene; novel loci: GffCompare codes ‘i,’ ‘u,’ ‘y,’ or ‘x’, which means predicted transcript cannot match either a reference transcript or a reference locus. The type ‘y’ only has 134 transcripts, a small proportion that is not visible in the pie chart. Student’ t tests were carried out between two groups of transcripts, and p values were adjusted by using false discovery rate (FDR) method (Benjamini and Hochberg, 1995).
FIGURE 4
FIGURE 4
Analysis of tissue-specificity across tissues. (A) Tissue specificity index (TSI) as a function of different types of transcripts classified by GffCompare. Code “ = ” means the intron chains of our annotated transcripts can exactly match to reference annotations (Exact match); Codes ‘c,’ ‘k,’ ‘j,’ ‘m,’ ‘n,’ or ‘o’ mean predicted transcript cannot match a reference transcript but can match a reference gene (Novel isoform); Codes ‘i,’ ‘u,’ ‘y,’ or ‘x’ means predicted transcript cannot match either a reference transcript or a reference locus (novel loci). (B) Transcript expression measured as transcript per million (TPM) as a function of TSI. We grouped transcripts according to their expressions. (C) Number of tissue-specific transcripts in each tissue. (D) A screenshot showing a novel transcript only predicted by our data, which is located on chromosome 4 (52,482,563–52,492,561). (E) TPM expressions of the predicted lncRNA transcript shown in the panel (D). The transcript is highly expressed in testes samples, but not any other tissue. The FEELnc predicted it as a sense intergenic lncRNA.
FIGURE 5
FIGURE 5
Functional enrichment of tissue-specific transcripts and differential alternative splicing analysis. (A) Heatmap depicting the negative log10FDR (false discovery rate) values for the top 10 Gene Ontology (GO) Biological Process terms. At the right side, we show several examples of GO terms, as well as their FDR values. (B) Number of unique transcripts detected as a function of tissues added. Transcripts are categories into three types (see Methods). (C). Sashimi plots of CYB561A3 gene that showed DAS between heart (red) and testis (blue).

Similar articles

Cited by

References

    1. Amarasinghe S. L., Su S., Dong X., Zappia L., Ritchie M. E., Gouil Q. (2020). Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30. 10.1186/s13059-020-1935-5 - DOI - PMC - PubMed
    1. Anders S., Pyl P. T., Huber W. (2015). HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169. 10.1093/bioinformatics/btu638 - DOI - PMC - PubMed
    1. Andersson L., Archibald A. L., Bottema C. D., Brauning R., Burgess S. C., Burt D. W., et al. (2015). Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biol. 16, 57. 10.1186/s13059-015-0622-4 - DOI - PMC - PubMed
    1. Baralle F. E., Giudice J. (2017). Alternative splicing as a regulator of development and tissue identity. Nat. Rev. Mol. Cell Biol. 18, 437–451. 10.1038/nrm.2017.27 - DOI - PMC - PubMed
    1. Beiki H., Liu H., Huang J., Manchanda N., Nonneman D., Smith T. P. L., et al. (2019). Improved annotation of the domestic pig genome through integration of Iso-Seq and RNA-seq data. BMC Genomics 20, 344. 10.1186/s12864-019-5709-y - DOI - PMC - PubMed