Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Nov 5:13:595.
doi: 10.1186/1471-2164-13-595.

Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote

Affiliations

Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote

Shikai Liu et al. BMC Genomics. .

Abstract

Background: Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly.

Results: In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly). Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop codon confirmed the correct assembly of the full-length transcripts.

Conclusions: The large set of transcripts assembled in this study is the most comprehensive set of genome resources ever developed from catfish, which will provide the much needed resources for functional genome research in catfish, serving as a reference transcriptome for genome annotation, analysis of gene duplication, gene family structures, and digital gene expression analysis. The putative set of duplicated genes provide a starting point for genome scale analysis of gene duplication in the catfish genome, and should be a valuable resource for comparative genome analysis, genome evolution, and genome function studies.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of taxonomic groups of BLAST top hit species. (A) The top hit species of BLAST searches are categorized into vertebrates, invertebrates, plants, bacteria, fungi and virus, and their percentages are presented. (B) The top hit species of BLAST searches within vertebrates are sub-categorized into mammals, birds, amphibians, reptiles, and fish, and their percentages are presented. (C) The top hit species of BLAST searches within fish are sub-categorized into various fish species as indicated, and their percentages are presented. (D) The top hit species of BLAST searches within bacteria are sub-categorized into various bacterial species as indicated, and their percentages are presented.
Figure 2
Figure 2
The contig length comparison between contigs with and without protein hits. The X-axis represents the contig length, and Y-axis represents the percentage of contigs. Note the high percentage of contigs that do not have significant protein hits in public protein database are in short length (83% less than 600 bp), and the high proportion of contigs with protein hits are long contigs.
Figure 3
Figure 3
Homologous distribution of identified catfish genes on zebrafish chromosomes. X-axis represents 25 zebrafish chromosomes. The left Y-axis represents the number of genes, and right Y-axis is the percentage of zebrafish genes on each chromosome identified in catfish.
Figure 4
Figure 4
Detection of putative catfish gene duplicates. X-axis represents the number of PSVs or MSVs detected, while the Y-axis is the number of putative duplicated genes in catfish that contained the PSVs or MSVs.
Figure 5
Figure 5
Comparison of the lengths of deduced catfish proteins with homologous reference proteins from databases. (A) Fractional distribution of catfish proteins with lengths falling within various fractions, e.g., 0 indicates identical lengths of catfish proteins with reference proteins; -0.2 and 0.2 indicates the lengths of catfish proteins are 20% shorter or longer than those of reference proteins, respectively, and so on. Overall 66% or 16,538 transcripts out of a total of 25,144 identified unique catfish genes are within 80% bracket as compared with the lengths of reference protein counterparts. (B) Ratio of catfish predicted protein length versus length of reference protein was indicated in histograms (left Y axis), and the curved line denotes the cumulative percentage (right Y-axis). X-axis is the ratio of predicted catfish protein length to corresponding reference orthologous protein length, i.e. catfish protein length/reference protein length. Note that the unit of X-axis is ten times of that in (A), i.e. the ratio of 1.0 represents the transcripts of length within 95% bracket as compared with the lengths of reference protein counterparts, etc. The left Y-axis represents the number of occurrence of catfish protein lengths in thousand, and right Y-axis is the cumulative percentage.
Figure 6
Figure 6
Length distributions of putative catfish full-length transcripts (A), ORF (B), 5’-UTR (C), and 3’-UTR (D).
Figure 7
Figure 7
Length comparisons of deduced catfish ORFs of the assembled full-length transcripts with homologous reference proteins. X-axis: catfish predicted protein length (amino acids), and Y-axis: reference protein length (amino acids).
Figure 8
Figure 8
Analysis of Kozak consensus sequences surrounding the start codon AUG in the catfish full-length transcripts. Kozak consensus sequences were illustrated by WebLogo using stacks of symbols, one stack for each position in the sequence. The size of symbols within the stack indicates the relative frequency of each base at that position.
Figure 9
Figure 9
Analysis of consensus sequences surrounding the stop codon in catfish full-length transcripts. The consensus sequences were illustrated by WebLogo using stacks of symbols, one stack for each position in the sequence. The size of the symbols within the stack indicates the relative frequency of each base at that position.
Figure 10
Figure 10
Evaluation of sequencing depth for the assembly of the catfish transcriptome assembly. (A) Assemblies were evaluated based on the number of assembled contigs with length ≥ 200 bp and 1 kb. The X-axis represents assemblies with various sequencing depths generated by CLC Genomics Workbench, left Y-axis represents the number of contigs with length ≥ 200 bp in thousand, and right Y-axis represents the number of contigs with length ≥ 1 kb in thousand. (B) Assemblies were evaluated based on the number of zebrafish proteins that were identified in the assembled catfish contigs. The X-axis represents assemblies with various sequencing depths assembled by CLC Genomics Workbench, left Y-axis represents the percentage of zebrafish proteins that can be detected in catfish and right Y-axis represents the percentage of zebrafish proteins that can be detected in catfish with match length ≥ 90%.

References

    1. Adamidi C, Wang Y, Gruen D, Mastrobuoni G, You X, Tolle D, Dodt M, Mackowiak SD, Gogol-Doering A, Oenal P. et al.De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics. Genome Res. 2011;21(7):1193–1200. doi: 10.1101/gr.113779.110. - DOI - PMC - PubMed
    1. Bruno VM, Wang Z, Marjani SL, Euskirchen GM, Martin J, Sherlock G, Snyder M. Comprehensive annotation of the transcriptome of the human fungal pathogen Candida albicans using RNA-seq. Genome Res. 2010;20(10):1451–1458. doi: 10.1101/gr.109553.110. - DOI - PMC - PubMed
    1. Furuno M, Kasukawa T, Saito R, Adachi J, Suzuki H, Baldarelli R, Hayashizaki Y, Okazaki Y. CDS annotation in full-length cDNA sequence. Genome Res. 2003;13(6B):1478–1487. - PMC - PubMed
    1. Denoeud F, Aury JM, Da Silva C, Noel B, Rogier O, Delledonne M, Morgante M, Valle G, Wincker P, Scarpelli C. et al.Annotating genomes with massive-scale RNA sequencing. Genome Biol. 2008;9(12):R175. doi: 10.1186/gb-2008-9-12-r175. - DOI - PMC - PubMed
    1. Liu S, Zhou Z, Lu J, Sun F, Wang S, Liu H, Jiang Y, Kucuktas H, Kaltenboeck L, Peatman E. et al.Generation of genome-scale gene-associated SNPs in catfish for the construction of a high-density SNP array. BMC Genomics. 2011;12:53. doi: 10.1186/1471-2164-12-53. - DOI - PMC - PubMed

Publication types