Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar 30:11:210.
doi: 10.1186/1471-2164-11-210.

Large-scale analysis of full-length cDNAs from the tomato (Solanum lycopersicum) cultivar Micro-Tom, a reference system for the Solanaceae genomics

Affiliations

Large-scale analysis of full-length cDNAs from the tomato (Solanum lycopersicum) cultivar Micro-Tom, a reference system for the Solanaceae genomics

Koh Aoki et al. BMC Genomics. .

Abstract

Background: The Solanaceae family includes several economically important vegetable crops. The tomato (Solanum lycopersicum) is regarded as a model plant of the Solanaceae family. Recently, a number of tomato resources have been developed in parallel with the ongoing tomato genome sequencing project. In particular, a miniature cultivar, Micro-Tom, is regarded as a model system in tomato genomics, and a number of genomics resources in the Micro-Tom-background, such as ESTs and mutagenized lines, have been established by an international alliance.

Results: To accelerate the progress in tomato genomics, we developed a collection of fully-sequenced 13,227 Micro-Tom full-length cDNAs. By checking redundant sequences, coding sequences, and chimeric sequences, a set of 11,502 non-redundant full-length cDNAs (nrFLcDNAs) was generated. Analysis of untranslated regions demonstrated that tomato has longer 5'- and 3'-untranslated regions than most other plants but rice. Classification of functions of proteins predicted from the coding sequences demonstrated that nrFLcDNAs covered a broad range of functions. A comparison of nrFLcDNAs with genes of sixteen plants facilitated the identification of tomato genes that are not found in other plants, most of which did not have known protein domains. Mapping of the nrFLcDNAs onto currently available tomato genome sequences facilitated prediction of exon-intron structure. Introns of tomato genes were longer than those of Arabidopsis and rice. According to a comparison of exon sequences between the nrFLcDNAs and the tomato genome sequences, the frequency of nucleotide mismatch in exons between Micro-Tom and the genome-sequencing cultivar (Heinz 1706) was estimated to be 0.061%.

Conclusion: The collection of Micro-Tom nrFLcDNAs generated in this study will serve as a valuable genomic tool for plant biologists to bridge the gap between basic and applied studies. The nrFLcDNA sequences will help annotation of the tomato whole-genome sequence and aid in tomato functional genomics and molecular breeding. Full-length cDNA sequences and their annotations are provided in the database KaFTom http://www.pgb.kazusa.or.jp/kaftom/ via the website of the National Bioresource Project Tomato http://tomato.nbrp.jp.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Data processing scheme of tomato full-length cDNA sequences. (A) Scheme for data processing of tomato full-length cDNA sequences. Four separate full-length-enriched libraries, LEFL1, FC, LEFL2, and LEFL3, were constructed. From randomly chosen clones, we obtained high-quality 5'-end sequences, 30,679, 8046, 18,697, and 27,216 sequences from the LEFL1, FC, LEFL2, and LEFL3 libraries, resprectively. These high-quality 5'-end sequences were registered in the EST division of the DDBJ. These were combined with 238,157 public tomato ESTs and then clustered into 76,276 groups. Clusters containing FC or LEFL sequences as a member were selected. The FC or LEFL sequence with the longest 5'-extension was chosen as the representative of each cluster and sent for full-length sequencing. Full-length sequencing was finished for 13,227 cDNAs, which were registered in the high throughput cDNA (HTC) division of the DDBJ. From 13,227 HTCs, 12,106 non-redundant full-length cDNAs were chosen. The 12,106 full-length cDNA set was tested for whether it contained non-coding RNA-derived cDNAs, pathogen transcript-derived cDNAs, chimeric clones, and cDNAs containing retained introns. After excluding these sequences, a set of 11,597 non-redundant HTCs was checked for CDS. Finally, a set of 11,502 non-redundant full-length cDNAs (nrFLcDNAs) was generated for subsequent sequence analyses. (B) Distribution of the number of 5'-end sequences derived from FC and LEFL cDNA libraries in each cluster.
Figure 2
Figure 2
Similarity of nrFLcDNAs with public tomato sequences. (A) Similarity of the 11,502 nrFLcDNAs to DFCI tomato tentative consensus (TC), SGN tomato unigene, and the prerelease of tomato genome shotgun sequence (Tomato GSS). Black bars: nrFLcDNAs that showed very high similarity (E-value < 1e-180), white bars; nrFLcDNAs that showed very low similarity (E-value > 1e-10). (B) Similarity of amino acid sequences predicted from the 11,502 nrFLcDNAs to proteins in nr (white rectangle) and the Tomato SBM database (black rectangle).
Figure 3
Figure 3
Distribution of characteristic lengths of nrFLcDNAs. (A) cDNA insert length, (B) CDS length, (C) 5'-UTR length, and (D) 3'-UTR length. 5'- and 3'-UTR lengths were longer than those of Arabidopsis, soybean, poplar, and maize, and slightly shorter than those of rice.
Figure 4
Figure 4
Scheme for detection of alternatively spliced transcripts. Each nrFLcDNA was searched against nrFLcDNA dataset, DFCI tomato TCs, and SGN tomato unigenes using BLASTN. Query nrFLcDNAs were designated as "target nrFLcDNAs." Sequences that matched the target nrFLcDNA by a threshold E-value of 1e-50 were grouped into a multi-sequence group. Intron detection was performed using the est2genome program by setting the target nrFLcDNA as "est" and member nrFLcDNA, unigene, or TC as "genome", and vice versa. Consequently, splicing variants were detected for 1,206 target nrFLcDNAs. Out of the 1,206 nrFLcDNAs, 388 nrFLcDNAs were aligned to SGN tomato BAC sequences (bacs.v374.seq.20081128091837). Classification of alternative splicing events was carried out manually using the 388 multi-sequence groups.
Figure 5
Figure 5
Profile of GO annotations for nrFLcDNAs. To obtain GO annotations for nrFLcDNAs, similarity between the amino acid sequences predicted from nrFLcDNAs and Arabidopsis proteins was assessed using BLASTP. Arabidopsis genes corresponding to the top hit Arabidopsis protein to each nrFLcDNA was chosen at threshold E-values of 1e-50 (gray bars). GO annotations for nrFLcDNAs were then retrieved by subjecting the list of Arabidopsis genes to a TAIR GO annotation search http://www.arabidopsis.org/tools/bulk/go/index.jsp. GO annotations for all Arabidopsis genes (black bars) were retrieved from the TAIR GO annotation search. No statistically significant difference in the frequency was observed in all categories.
Figure 6
Figure 6
Profile of metabolic pathway annotations for the nrFLcDNAs according to LycoCyc. To obtain LycoCyc annotations for the nrFLcDNAs, similarity between nrFLcDNAs and SGN tomato unigenes was assessed using BLASTN. The top hit SGN tomato unigene to each nrFLcDNA was extracted at a threshold E-value of 1e-180 (gray bars). LycoCyc pathway annotations for the top hit unigenes were then regarded as annotations for nrFLcDNAs. Pathway annotations for all SGN tomato unigenes (black bars) was retrieved from documents of LycoCyc ver 1.0.1.1. Asterisks indicate that the difference in the frequency is statistically significant at 5% level.
Figure 7
Figure 7
Comparison of nrFLcDNAs with genes of other plants. (A) Percentage of nrFLcDNAs with very high similarity (E-value < 1e-180, black) and very low similarity (E-value > 1e-10, gray) to genes of other plants. (B) Distribution of the number of nrFLcDNAs matched plant unigenes with intermediate similarity between 1e-180 and 1e-10.
Figure 8
Figure 8
Distribution of the number and lengths of exons and introns. (A) Number of exons per gene. (B) Distribution of exon length (black curve) and intron length (gray curve). (C) Distribution of lengths of the initial exons (black rectangle), the internal exons (white rectangle), the terminal exons (white triangle), and the single exons (dotted line). (D) Distribution of gene length.

References

    1. Knapp S. Tobacco to tomatoes: a phylogenetic perspective on fruit diversity in the Solanaceae. J Exp Bot. 2002;53:2001–2022. doi: 10.1093/jxb/erf068. - DOI - PubMed
    1. Giovannoni JJ. Genetic regulation of fruit development and ripening. Plant Cell. 2004;16(Suppl):S170–180. doi: 10.1105/tpc.019158. - DOI - PMC - PubMed
    1. Pedley KF, Martin GB. Molecular basis of Pto-mediated resistance to bacterial speck disease in tomato. Annu Rev Phytopathol. 2003;41:215–243. doi: 10.1146/annurev.phyto.41.121602.143032. - DOI - PubMed
    1. Mueller LA, Tanksley SD, Giovannoni JJ, van Eck J, Stack S, Choi D, Kim BD, Chen M, Cheng Z, Li C. et al.The Tomato Sequencing Project, the First Cornerstone of the International Solanaceae Project (SOL) Comp Funct Genomics. 2005;6:153–158. doi: 10.1002/cfg.468. - DOI - PMC - PubMed
    1. Mueller LA, Solow TH, Taylor N, Skwarecki B, Buels R, Binns J, Lin C, Wright MH, Ahrens R, Wang Y. et al.The SOL Genomics Network: a comparative resource for Solanaceae biology and beyond. Plant Physiol. 2005;138:1310–1317. doi: 10.1104/pp.105.060707. - DOI - PMC - PubMed

Publication types