Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Dec;13(12):2725-35.
doi: 10.1101/gr.1532103. Epub 2003 Nov 12.

Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane

Affiliations

Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane

André L Vettore et al. Genome Res. 2003 Dec.

Abstract

To contribute to our understanding of the genome complexity of sugarcane, we undertook a large-scale expressed sequence tag (EST) program. More than 260,000 cDNA clones were partially sequenced from 26 standard cDNA libraries generated from different sugarcane tissues. After the processing of the sequences, 237,954 high-quality ESTs were identified. These ESTs were assembled into 43,141 putative transcripts. Of the assembled sequences, 35.6% presented no matches with existing sequences in public databases. A global analysis of the whole SUCEST data set indicated that 14,409 assembled sequences (33% of the total) contained at least one cDNA clone with a full-length insert. Annotation of the 43,141 assembled sequences associated almost 50% of the putative identified sugarcane genes with protein metabolism, cellular communication/signal transduction, bioenergetics, and stress responses. Inspection of the translated assembled sequences for conserved protein domains revealed 40,821 amino acid sequences with 1415 Pfam domains. Reassembling the consensus sequences of the 43,141 transcripts revealed a 22% redundancy in the first assembling. This indicated that possibly 33,620 unique genes had been identified and indicated that >90% of the sugarcane expressed genes were tagged.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sugarcane gene prediction classification. We classified 26,525 SASs with similarities to known protein sequences in the nonredundant protein (nr) database into 18 functional categories. The categories were generated either by automatic BLASTX of SASs against the categorized databases (cog.aa, http://www.ncbi.nlm.nih.gov/cgi-bin/COG/palog?fun=all; egad.aa, http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_report.spl; kegg.aa, http://www.genome.ad.jp/kegg/kegg2.html; mips-at.aa, http://www.mips.biochem.mpg.-de/cgi-bin/proj/thal/filter_funcat.pl?all) or by manual inspection of BLASTX matches in the nr database by members of the SUCEST Consortium. In both cases, the BLASTX E-value cutoff was ≤10-5. The percentage of SASs found in each category is indicated next to the corresponding pie chart sector. The average percentages of SASs per category do not add up to 100% because some contigs or singletons appear in more than one category.
Figure 2
Figure 2
Protein domain analysis. The 43,141 SASs were translated using the ESTScan algorithm, and the resulting 40,821 amino acid sequences were entered as queries in the Pfam database using the default settings of Pfam 7.0 [“global and local alignments merged” and “Pfam gathering threshold (GA)”]. A total of 12,921 SAS putative proteins produced significant matches with 1415 protein domain families of the Pfam database. (A) Number of distinct domains found for each SAS protein. The number of SAS proteins that contained one, two, three, four, or five distinct domains is shown. (B) Maximum number of repetitions for the top 14 repeated domains: nucleoporin FG (A), LRR (B), HEAT (C), M (D), PPR (E), TPR (F), XYPPX (G), WD40 (H), PC rep (I), ank (J), MORN (K), armadillo seq (L), PUF (M), and AT hook (N). The domains most often repeated in the same protein are shown along with the maximum number of repeats found for each domain. (C) Range of repetitions found for the LRR, PPR, TPR, WD40, rrm, and EF-hand domains. The domains with the most varied number of occurrences per SAS protein are indicated, along with the number of SAS proteins for each number of repeats.
Figure 3
Figure 3
The number of occurrences for the 25 most common Pfam domains in SAS proteins. The 43,141 SASs were translated using the ESTScan algorithm, and the resulting 40,821 amino acid sequences were then entered as queries in the Pfam database. For protein domain analysis, the default settings of Pfam 7.0 [“global and local alignments merged” and “Pfam gathering threshold (GA)”] were used.
Figure 4
Figure 4
The most common domains in tissue-enriched SASs. The SASs were inspected for tissue specificity using the library of origin of their EST components. An SAS was considered tissue-enriched when it contained at least three ESTs found exclusively in a given library. Of the 43,141 SASs, 1234 were tissue-enriched and were inspected for the presence of conserved protein domains. The SASs were translated using the ESTScan algorithm. Domain occurrence was revealed by querying the Pfam database with the corresponding amino acid sequences using the default settings of Pfam 7.0 [“global and local alignments merged” and “Pfam gathering threshold (GA)”]. The most represented domains among the tissue-enriched SASs along with the number of SASs for each of then are shown.
Figure 5
Figure 5
The most common transcription factor domains in tissue-enriched SASs. The SASs were inspected for tissue specificity using the library of origin of their EST components. An SAS was considered tissue-enriched when it contained at least three ESTs found exclusively in a given library. The SASs were translated using the ESTScan algorithm. Domain occurrence was revealed by querying the Pfam database with the corresponding amino acid sequences using the default settings of Pfam 7.0 [“global and local alignments merged” and “Pfam gathering threshold (GA)”]. The 10 most frequent transcription factor domains were determined (see Fig. 6) and inspected for tissue specificity. (A) The most represented transcription factor domains among the tissue-enriched SASs along with the number of SASs in each of them. (B) The same data as in A showing the tissue origin and number of occurrences.
Figure 6
Figure 6
The 10 most common transcription factor Pfam domains in SAS proteins. The 43,141 SASs were translated using the ESTScan algorithm, and the resulting 40,821 amino acid sequences were then entered as queries in the Pfam database using the default settings of Pfam 7.0 [“global and local alignments merged” and “Pfam gathering threshold (GA)”]. The 10 most represented domains typical of transcription factors are indicated along with the number of occurrences for each one. The number of predicted transcription factors found in Arabidopsis thaliana (Riechmann et al. 2000) and Oryza japonica (Goff et al. 2002) for the corresponding domains are also indicated.

Similar articles

Cited by

References

    1. Adams, M.D., Kerlavage, A.R., Fleischmann, R.D., Fuldner, R.A., Bult, C.J., Lee, N.H., Kirkness, E.F., Weinstock, K.G., Gocayne, J.D., White, O., et al. 1995. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 377: 3-174. - PubMed
    1. al Janabi, S.M., Honeycutt, R.J., McClelland, M., and Sobral, B.W. 1993. A genetic linkage map of Saccharum spontaneum L. `SES 208.' Genetics 134: 1249-1260. - PMC - PubMed
    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
    1. The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815. - PubMed
    1. Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., and Sonnhammer, E.L. 2000. The Pfam protein families database. Nucleic Acids Res. 28: 263-266. - PMC - PubMed

WEB SITE REFERENCES

    1. http://apps.fao.org; FAOSTAT Home Page.
    1. http://mips.gsf.de/proj/thal/db/tables/tables_func_frame.html; MATDB tables.
    1. http://sucest.lad.ic.unicamp.br/public; SUCEST Home Page.
    1. http://www.genome.ad.jp/kegg/kegg2.html; KEGG Encyclopedia.
    1. http://www.mips.biochem.mpg.de/cgi-bin/proj/thal/filter_funcat.pl?all; A. thaliana.Browse all contigs by functional catalog.

Publication types

MeSH terms

Associated data