. 2005 Oct 19:6:144.

doi: 10.1186/1471-2164-6-144.

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Affiliations

PMID: 16236172
PMCID: PMC1277824
DOI: 10.1186/1471-2164-6-144

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Nathalie Pavy et al. BMC Genomics. 2005.

. 2005 Oct 19:6:144.

doi: 10.1186/1471-2164-6-144.

Affiliation

¹ Pavillon Charles-Eugène-Marchand, Université Laval, Ste.Foy, Québec G1K 7P4, Canada. nathalie.pavy@rsvs.ulaval.ca

PMID: 16236172
PMCID: PMC1277824
DOI: 10.1186/1471-2164-6-144

Abstract

Background: The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss).

Results: We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations.

Conclusion: This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.

PubMed Disclaimer

Figures

**Figure 1**
Composition of white spruce consensus sequences (contigs and singletons) according to orientation of direction of the reads (3' or 5') and according to their redundancy in the database (number of clones).

**Figure 2**
**Sequence sizes**. Size distribution of the consensus sequences derived from the pine (PGI5.0) and white spruce (ArboreaSet) assemblies.

**Figure 3**
**Sequence similarities**. Number of white spruce transcript sequences similar to Uniref100 proteins, *Arabidopsis*, pine, *Cycas* according to the blast e-value cutoff.

**Figure 4**
**Hierarchical presentation of the number of spruce transcripts with or without similarities with pine, *Arabidopsis*, rice and poplar**. The numbers were derived by the filtering of *tblastx* searches with an e-value < 1e-10.

**Figure 5**
**Protein families**. Occurrence of the 30 most abundant protein families in the white spruce dataset identified by HMM searches with an e-value < 1e-10 against the PFAM database.

**Figure 6**
**Number of spruce consensus sequences (identified by HMM searches against PFAM) relative to the size of the gene families in *Arabidopsis* (a) and rice (b)**. Each point represents a protein family detected by the HMM searches with p-score < 1e-10. Point coordinates are the number of genes found in the analysed Angiosperm genome (x axis) and the number of contigs found in the spruce database (y axis), after a log transformation. The red, blue and green lines represent the ratios 1:1, 1:2, and 1:4, respectively. Red points represent sequences found 4 times more in white spruce than in *Arabidopsis*: 1. AWPM-19-like family [PF05512], 2. Chalcone and stilbene synthases, C-terminal domain [PF02797], 3. Phosphoenolpyruvate carboxykinase [PF01293]. Blue points represent sequences found 4 times more in spruce than in rice : 4. Ribosomal protein S28e [PF01200], 5. Cyclin-dependent kinase regulatory subunit [PF01111], 6. TIR domain [PF01582], 7. Splicing factor 3B subunit 10 [PF07189], 8. Ribosomal Proteins L2, C-terminal domain [PF03947]. Green points represent sequences found 4 times more in spruce compared to both *Arabidopsis* and rice: 9. Translationally controlled tumour protein [PF00838], 10. S-adenosyl-L-homocysteine hydrolase [PF05221], 11. S-adenosylmethionine synthetase, C-terminal domain [PF02773].

**Figure 7**
**SpruceDB core tables and data sources**. Data from flat files on ESTs, Assemblies and *blast* hits is loaded into the core tables Read, Contig, Contig_Element and Blast_Hsp. Additional information on taxonomy identifiers and Uniref100 peptides is obtained from shared databases.

**Figure 8**
**Examples of the interface of the SpruceDB database**. A) Use of Query 1 to search for contigs matching "cinnamoyl alcohol dehydrogenase" among the blastx results loaded in the database. B) Display of the results indicating alignment parameters (alignment length, similarity and identity level). C) BioDATA page linked to by clicking on MNC5693153 in Query 1 results. The upper figure illustrates the alignment of the members of the contigs in a color coded manner. Read names written in blue and white color refer to 5'and 3'reads, respectively. D) Query 8 allowing to retrieve sequence aliases and library names for specified MN_Ids. E) Query 8 results showing libraries GQ004 and GQ006.

See this image and copyright information in PMC

Cited by

A spruce gene map infers ancient plant genome reshuffling and subsequent slow evolution in the gymnosperm lineage leading to extant conifers.
Pavy N, Pelgas B, Laroche J, Rigault P, Isabel N, Bousquet J. Pavy N, et al. BMC Biol. 2012 Oct 26;10:84. doi: 10.1186/1741-7007-10-84. BMC Biol. 2012. PMID: 23102090 Free PMC article.
Comparative genome mapping among Picea glauca, P. mariana x P. rubens and P. abies, and correspondence with other Pinaceae.
Pelgas B, Beauseigle S, Acheré V, Jeandroz S, Bousquet J, Isabel N. Pelgas B, et al. Theor Appl Genet. 2006 Nov;113(8):1371-93. doi: 10.1007/s00122-006-0354-7. Epub 2006 Oct 24. Theor Appl Genet. 2006. PMID: 17061103
Modern Approaches for Transcriptome Analyses in Plants.
Riaño-Pachón DM, Espitia-Navarro HF, Riascos JJ, Margarido GRA. Riaño-Pachón DM, et al. Adv Exp Med Biol. 2021;1346:11-50. doi: 10.1007/978-3-030-80352-0_2. Adv Exp Med Biol. 2021. PMID: 35113394
Identification and functional characterization of monofunctional ent-copalyl diphosphate and ent-kaurene synthases in white spruce reveal different patterns for diterpene synthase evolution for primary and secondary metabolism in gymnosperms.
Keeling CI, Dullat HK, Yuen M, Ralph SG, Jancsik S, Bohlmann J. Keeling CI, et al. Plant Physiol. 2010 Mar;152(3):1197-208. doi: 10.1104/pp.109.151456. Epub 2009 Dec 31. Plant Physiol. 2010. PMID: 20044448 Free PMC article.
Comparative analysis of the small RNA transcriptomes of Pinus contorta and Oryza sativa.
Morin RD, Aksay G, Dolgosheina E, Ebhardt HA, Magrini V, Mardis ER, Sahinalp SC, Unrau PJ. Morin RD, et al. Genome Res. 2008 Apr;18(4):571-84. doi: 10.1101/gr.6897308. Epub 2008 Mar 6. Genome Res. 2008. PMID: 18323537 Free PMC article.

See all "Cited by" articles

References

1. Ahuja MR. Recent advances in molecular genetics of forest trees. Euphytica. 2001;121:173–195. doi: 10.1023/A:1012226319449. - DOI
1. Dhillon SS. DNA in tree species. In: Bonga JM, Durzan DJ, editor. Cell and Tissue Culture in Forestry. Vol. 1. Martinus Nijhoff Publishers, Dordrecht; 1987. pp. 298–313.
1. Wakamiya I, Newton RJ, Price JS. Genome size and environmental factors in the genus Pinus. Am J Bot. 1993;80:1235–1241.
1. Rake AW, Miksche JP, Hall RB, Hanson KM. DNA reassociation kinetics for four conifers. Can J Genet Cytol. 1980;22:69–79.
1. Ohri D, Khoshoo TN. Genome size in gymnosperms. Plant Syst Evol. 1986;153:119–132. doi: 10.1007/BF00989421. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- SILVA
- The Arabidopsis Information Resource
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Affiliation

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials