Transcriptome annotation using tandem SAGE tags

Eric Rivals¹, Anthony Boureux, Mireille Lejeune, Florence Ottones, Oscar Pecharromàn Pérez, Jorma Tarhio, Fabien Pierrat, Florence Ruffle, Thérèse Commes, Jacques Marti

Affiliations

PMID: 17709346
PMCID: PMC2034470
DOI: 10.1093/nar/gkm495

Transcriptome annotation using tandem SAGE tags

Eric Rivals et al. Nucleic Acids Res. 2007.

. 2007;35(17):e108.

doi: 10.1093/nar/gkm495. Epub 2007 Aug 20.

Authors

Eric Rivals¹, Anthony Boureux, Mireille Lejeune, Florence Ottones, Oscar Pecharromàn Pérez, Jorma Tarhio, Fabien Pierrat, Florence Ruffle, Thérèse Commes, Jacques Marti

Affiliation

¹ Laboratoire d'Informatique, de Robotique et de Microélectronique, UMR 5506 CNRS-Université de Montpellier II, 161 rue Ada, 34392 Montpellier 05, France.

PMID: 17709346
PMCID: PMC2034470
DOI: 10.1093/nar/gkm495

Abstract

Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation.

PubMed Disclaimer

Figures

**Figure 1.**
(A) Schematics of search for tag-delimited genomic sequences (TDGS, double arrows). Upper part: procedure for assembling 5′G- 3′C pairs. Starting from the previously identified C-tag (n − 1), the program searches the next site on which an experimental C-tag (n) can be positioned (star 1). The genome sequence is then scanned for G-tags (x − 1, x) and stops (star 2) when the shortest G–C pair is found. In search of the next pair (stars 3 and 4), the G-tag (z) potential tag sequence is skipped because it does not match any G-tag in the experimental dataset. Lower part (**B,C** and D): illustration of the various causes of success and failure in assembling TDGS. Numerical values in B and C are taken from the study of 489 well-annotated sequences identified in the macrophage SAGE libraries. Cases schematized in D are detailed (d) in Figures 3 and 4.

**Figure 2.**
Number of distinct C-tags (left y-axis, black square) in five consecutive classes of occurrence, i.e. abundance, (x-axis) from 1 to >2048 counts summed up over UniSAGE, and percentage of tags matching RefSeq annotated sequences (right y-axis, gray diamonds).

**Figure 3.**
Characterization and annotation of validated TDGS. Alignments of the TDGS# 20 and # 212 to the UCSC human genome browser. For RT–PCR validation, Macrophage poly(A)+ RNA were extracted from MDM and the cDNA were synthesized using mRNA and oligo-dT primer. (A) TDGS# 20 corresponds to an example of Class 2 transcript localized near the coding region of CDH23. For PCR, a primer pair was respectively designed in the 3′-end of CDH23 and in the TDGS # 20. The existence of this new variant transcript was confirmed in macrophage by sequencing. (B) TDGS # 212 is an example of class 3 transcript. Experiments without reverse transcription (A, C) and with DNAse treatment (C, D) were performed to detect DNA contamination. For transcript validation, a first PCR was realized with primers pairs designed on TDGS # 212 and a second one with primers respectively in the 3′-end of EST EB10260 and TDGS #212. The 3′-end of the transcript was validated by 3′RACE (3′RACER kit, Invitrogen, France). The sequenced PCR products validated the existence of a transcript in inverted orientation relatively to the Cathepsin B gene (CTSB, NM_0019082).

**Figure 4.**
Characterization and annotation of validated TDGS. Alignments of the TDGS# 5, 6, 54 to the UCSC human genome browser. For RT–PCR validation, Macrophage poly(A)⁺ RNA were extracted from MDM and the cDNA were synthesized using mRNA and oligo-dT primer. (A and B) TDGS # 5 (376 bp) maps on chomosome 1, corresponding to the sequence of new full-length cDNA registered in GenBank as CR601947. TDGS # 6, shares with TDGS # 5 the same C-tag (dotted boxes) but its sequence is longer (821 bp) because its G-tag is located in 3′ of the TDGS # 5 one. (A and C) In addition, TDGS # 54, reveals another site, in a region of chromosome 14. The same conditions as described in Figure 4 were used to PCR validation. RT–PCR analysis followed by sequence checking confirm the existence of this new transcript. The sequence of chromosome 14 matched a sequence registered in the Affymetrix dataset harbored at UCSC (Affy Txn Phase2)

**Figure 5.**
Length of the TDGS assembled from the whole UniSAGE data. Frequency of TDGS length (in base pairs) on well-known annotated TDGS (R1 TDGS) (A) and on unkown TDGS (unmatched TDGS) (B). Each point represents a TDGS (in gray)(∼99 and 92.5% TDGS are shown, respectively), and the curve (in black) is a power regression curve.

See this image and copyright information in PMC

Cited by

Combining DGE and RNA-sequencing data to identify new polyA+ non-coding transcripts in the human genome.
Philippe N, Bou Samra E, Boureux A, Mancheron A, Rufflé F, Bai Q, De Vos J, Rivals E, Commes T. Philippe N, et al. Nucleic Acids Res. 2014 Mar;42(5):2820-32. doi: 10.1093/nar/gkt1300. Epub 2013 Dec 18. Nucleic Acids Res. 2014. PMID: 24357408 Free PMC article.
Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity.
Philippe N, Boureux A, Bréhélin L, Tarhio J, Commes T, Rivals E. Philippe N, et al. Nucleic Acids Res. 2009 Aug;37(15):e104. doi: 10.1093/nar/gkp492. Epub 2009 Jun 16. Nucleic Acids Res. 2009. PMID: 19531739 Free PMC article.

References

1. Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. - PubMed
1. Claverie JM. Fewer genes, more noncoding RNA. Science. 2005;309:1529–1530. - PubMed
1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006;34:D173–D180. - PMC - PubMed
1. Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. - PubMed
1. Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005;308:1149–1154. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Transcriptome annotation using tandem SAGE tags

Affiliation

Transcriptome annotation using tandem SAGE tags

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources