Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jun;132(2):469-84.
doi: 10.1104/pp.102.018101.

Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping

Affiliations

Refined annotation of the Arabidopsis genome by complete expressed sequence tag mapping

Wei Zhu et al. Plant Physiol. 2003 Jun.

Abstract

Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Classification of Arabidopsis ESTs based on spliced alignment quality. Of a total of 176,915 ESTs, 2,059 ESTs have no significant hits in the Arabidopsis genome, 4,968 ESTs have only low-quality spliced alignments (lqEST), and the remaining 169,888 ESTs have hqSPAs. The latter category consists of 146,527 ESTs that match a unique (i.e. their cognate) locus in the genome (unique high-quality ESTs [uhqEST]) and 23,361 ESTs that have multiple hqESTs (mhqESTs), representing different loci of duplicated genes or multigene families.
Figure 2.
Figure 2.
Distribution of the 170 hqSPAs for EST gi:9787698 on the Arabidopsis genome. Each chromosome is represented by two dark-green bars, with the centromere marked by a space between the horizontal bars. Locations of the spliced alignments are shown by red bricks. Almost all hits are around the centromeres. Alignment scores suggest that the EST originates from the 12,075,567- to 12,075,806-bp region on chromosome three (marked by the green arrow). This EST shows high similarity with Arabidopsis gene At1g38360 (gi:18426880), a putative retroelement polyprotein gene. This display is shown as an example of the visualization tools at Arabidopsis Genome Database (AtGDB) that will dynamically generate similar graphics for any set of GenBank gi accessions or genes matched by common descriptions.
Figure 3.
Figure 3.
Distribution of the score differences between maximal and submaximal scoring hqSPAs for mhqESTs. Each hqSPA is scored by the product of similarity and coverage values (see “Materials and Methods”). Most of the score differences fall in the range 0.08 to 0.20. Based on the displayed distribution, a critical value 0.015 was set such that each hqSPA with a score difference smaller than 0.015 compared with the maximal scoring hqSPA for a given EST is designated pcSPA, representing the likely origin of this specific EST in the genome.
Figure 4.
Figure 4.
Histogram showing the distribution of pcSPA similarity, coverage, and combined scores.
Figure 5.
Figure 5.
Visual assessment of EST clustering and gene characteristics for a region of the Arabidopsis genome. In the display, which is available for all genomic regions at http://www.plantgdb.org/AtGDB/, pcSPAs originating from EST spliced alignments are shown in red and non-pcSPAs in pink. For multi-exon alignments, the arrow indicates the direction of transcription, inferred from the implied splice site patterns (Usuka et al., 2000). Multi-exon 5′ ESTs are marked by green color at their 5′ terminus, and multi-exon 3′ ESTs are marked by blue color at their 3′ terminus. Single-exon ESTs have corresponding 5′/3′ labels at the center of their representations. Pairs of 5′ and 3′ ESTs from the same clone are grouped by green boxes. PcSPAs originating from cDNA spliced alignments are shown in light blue, and non-pcSPAs are shown in gray. Dark-blue gene structures represent the current GenBank gene annotations for this region. The 5′ and 3′ boundaries of the corresponding coding regions are indicated by green and red triangles, respectively. Note that the current annotation misses the gene represented by clone pair ESTs gi:19867004 and gi:19822861 and gi:19878951 and gi:19799838. The purple structures represent the spliced alignments of TIGR Arabidopsis Gene Index tentative contigs. The figure also shows an alternatively spliced internal mini-exon. This exon of 16 nucleotides occurs in the 5′-UTR of At4g38510, an H+-transporting ATPase (EC 3.6.1.35). The transcript isoform including this intron is supported by ESTs gi:9785303 and gi:8722457. In the same region, EST gi:9787070 supports a different internal exon of 73 nucleotides, and EST gi:19867985 (equal to RAFL-15010615) indicates an alternative transcription start. Note that all sequence records at AtGDB are identified by their unique GenBank gi identifiers. The Riken Arabidopsis full-length (RAFL) cDNAs (Seki et al., 2002) thus indicated as RAFL-15451093, RAFL-18377451, RAFL-20268790, RAFL-21689814, RAFL-15010783, RAFL-14517367, RAFL-16323357, RAFL-15010615, and RAFL-19699257 correspond to clones RAFL05-11-M12, U16016, RAFL06-81-F18, U11966, RAFL03-01-G10, RAFL04-09-A19, U12748, RAFL07-17-H08, and U12937, respectively.
Figure 6
Figure 6
Spliced alignment of Arabidopsis EST gi:5839990 with: A, the Arabidopsis At3g53520 gene encoding a dTDP-Glc 4-6-dehydratase-like protein; and B, a rice (Oryza sativa) genomic sequence (accession no. AP003271). The two alignments reveal conserved gene structure between Arabidopsis and rice, including a conserved AT-AC intron. C, Pair-wise alignment of the orthologous AT-AC intron sequences. The conserved donor site (ATATCCTY) and branch site motifs (TCCTTRAY) are highlighted in red color.
Figure 6
Figure 6
Spliced alignment of Arabidopsis EST gi:5839990 with: A, the Arabidopsis At3g53520 gene encoding a dTDP-Glc 4-6-dehydratase-like protein; and B, a rice (Oryza sativa) genomic sequence (accession no. AP003271). The two alignments reveal conserved gene structure between Arabidopsis and rice, including a conserved AT-AC intron. C, Pair-wise alignment of the orthologous AT-AC intron sequences. The conserved donor site (ATATCCTY) and branch site motifs (TCCTTRAY) are highlighted in red color.
Figure 7.
Figure 7.
Visualization of an annotated, normally expressed internal mini-exon. The exon of six nucleotides found in the 3′-coding region of At5g14030 (encoding an unknown protein) is supported by 12 different EST spliced alignments. Strikingly, this miniature exon is also conserved in what appears to be a rice homolog of this gene (see Fig. 8). Symbols are as in Figure 5. The three cDNAs identified by GenBank gi as CT-21404330, RAFL-14517445, RAFL-22136543 correspond to Ceres/TIGR full-length cDNA 16313 and RAFL clones RAFL02-05-J08 and U12778, respectively.
Figure 8.
Figure 8.
Evolutionary conservation of a miniexon. A, Spliced alignment of the translated ORF (bottom lines) originating from the EST cluster shown in Figure 7 with a rice genomic clone (GenBank accession no. AP003727); the alignment was made with the GeneSeqer program (Usuka and Brendel, 2000). B, Alignment of the Arabidopsis mini-exon and its flanking introns with a homologous region of the rice genome. The mini-exon is highlighted in red characters, the intron donor sites in green, and the intron acceptor sites in blue.

References

    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
    1. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796–815 - PubMed
    1. Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 28–36 - PubMed
    1. Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14: 48–54 - PubMed
    1. Berget SM (1995) Exon recognition in vertebrate splicing. J Biol Chem 270: 2411–2414 - PubMed

Publication types