Comparative Study

. 2003 Jun;13(6B):1478-87.

doi: 10.1101/gr.1060303.

CDS annotation in full-length cDNA sequence

Masaaki Furuno¹, Takeya Kasukawa, Rintaro Saito, Jun Adachi, Harukazu Suzuki, Richard Baldarelli, Yoshihide Hayashizaki, Yasushi Okazaki

Affiliations

Affiliation

¹ Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.

PMID: 12819146
PMCID: PMC403693
DOI: 10.1101/gr.1060303

Comparative Study

CDS annotation in full-length cDNA sequence

Masaaki Furuno et al. Genome Res. 2003 Jun.

. 2003 Jun;13(6B):1478-87.

doi: 10.1101/gr.1060303.

Authors

Masaaki Furuno¹, Takeya Kasukawa, Rintaro Saito, Jun Adachi, Harukazu Suzuki, Richard Baldarelli, Yoshihide Hayashizaki, Yasushi Okazaki

Affiliation

¹ Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan.

PMID: 12819146
PMCID: PMC403693
DOI: 10.1101/gr.1060303

Abstract

The identification of coding sequences (CDS) is an important step in the functional annotation of genes. CDS prediction for mammalian genes from genomic sequence is complicated by the vast abundance of intergenic sequence in the genome, and provides little information about how different parts of potential CDS regions are expressed. In contrast, mammalian gene CDS prediction from cDNA sequence offers obvious advantages, yet encounters a different set of complexities when performed on high-throughput cDNA (HTC) sequences, such as the set of 60,770 cDNAs isolated from full-length enriched libraries of the FANTOM2 project. We developed a CDS annotation strategy that uses a variety of different CDS prediction programs to annotate the CDS regions of FANTOM2 cDNAs. These include rsCDS, which uses sequence similarity to known proteins; ProCrest; Longest-ORF and Truncated-ORF, which are ab initio based predictors; and finally, DECODER and NCBI CDS predictor, which use a combination of both principles. Aided by graphical displays of these CDS prediction results in the context of other sequence similarity results for each cDNA, FANTOM2 CDS inspection by curators and follow-up quality control procedures resulted in high quality CDS predictions for a total of 14,345 FANTOM2 clones.

PubMed Disclaimer

Figures

**Figure 1**
Flowchart for gene name and CDS annotation steps during MATRICS. The 60,770 FANTOM2 clones were clustered into 33,409 transcriptional units (The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team 2002). The gene name and CDS of representative clones from each TU were annotated by MATRICS curators. Of 17,200 protein-coding clones, 12,454 and 2128 were annotated as having a complete and partial CDS, respectively. The remaining 2618 clones have problems such as “immature,” “UTR,” or “unknown” (see Methods). Another 187 complete CDS and 50 partial CDS clones could not be translated because of low sequence quality.

**Figure 2**
Correlation between CDS prediction programs. The frequencies with which different CDS program combinations predicted identically to (colored boxes) or differently from (open boxes) human curated CDS regions are shown. The number of instances for the top three most frequently observed patterns is indicated in bold. The instances of patterns characterized by only a single CDS program predicting identically to human curation are shown in boxed numbers.

**Figure 3**
Annotation interface for clone 4930415J21 as an example of CDS correction by the NMD rule for termination codon placement. The image shows sequence similarity results for this FANTOM2 clone with protein, predicted CDS, ESTs, genome mapping, and sequence quality. In this case, CDS candidates by rsCDS, ProCrest, DECODER, NCBI CDS predictor, and Longest-ORF are shown from *top* to *bottom.* The CDS of this sequence runs from 93 (colored diamond) to 1731 (colored triangle). The TAG stop codon at 1311 (open triangle) is assumed to be false because of low sequence quality. The 3′-most splice junction is located at 1570 (open diamond). The annotated results for gene name and CDS status are shown at the *bottom.*

**Figure 4**
The distances between stop codons and last splice junctions. A histogram of the positions of stop codons from last splice junctions for 10,789 clones that were spliced and annotated as having a complete or 5′-truncated CDS is shown. The average number of bases between the positions of the last splice junction and the stop codon for these sequences is 208 bases. For these sequences, it is most common to find the stop codon within 400 bases downstream from the last splice junction.

**Figure 5**
Annotation interface for clone B930030L03, a potential target for NMD-mediated instability. The CDS region of this clone was annotated from positions 401 (colored diamond) to 1078 (colored triangle), based on similarity with cytokine inducible SH-2-containing protein 3. The CDS for this clone violates the NMD rule for termination codons; the stop codon position (1078, colored triangle) is located 191 bases upstream from the last splice junction (1269, open diamond); thus, such transcripts in the cell may be targets for degradation by the NMD system.

**Figure 6**
Annotation interface for clone 1110012E09, Selenoprotein R cDNA. The SECIS motif located between base positions 446 and 509 (dotted line) promotes decoding of the OPAL codon at position 311 (open triangle) to selenocysteine. The CDS region was annotated from positions 27 (colored diamond) to 377 (colored triangle). The last splice junction (378, open diamond) colocates with the stop codon.

**Figure 7**
Annotation interface for clone C630041L24, an example of a small protein containing a signal peptide. The CDS region from positions 119 (colored diamond) to 406 (colored triangle) encodes a 96-amino-acid protein. The predicted signal peptide at the N terminus of the CDS indicates that this is a soluble/secreted protein (Grimmond et al. 2003).

See this image and copyright information in PMC

References

1. Berry, M.J., Banu, L., Chen, Y.Y., Mandel, S.J., Kieffer, J.D., Harney, J.W., and Larsen, P.R. 1991. Recognition of UGA as a selenocysteine codon in type I deiodinase requires sequences in the 3′ untranslated region. Nature 353: 273-276. - PubMed
1. The FANTOM Consortium and the RIKEN Genome Exploration Research Group Phase I & II Team. 2002. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420: 563-573. - PubMed
1. Fukunishi, Y. and Hayashizaki, Y. 2001. Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol. Genom. 5: 81-87. - PubMed
1. Grimmond, S.M., Miranda, K.C., Yuan, Z., Davis, M.J., Hume, D.A., Yagi, K., Tominaga, N., Bono, H., Hayashizaki, Y., Okazaki, Y., et al. 2003. The mouse secretome. Functional classification of the proteins secreted into the extracellular environment. Genome Res. (this issue). - PMC - PubMed
1. Hentze, M.W. and Kulozik, A.E. 1999. A perfect message: RNA surveillance and nonsense-mediated decay. Cell 96: 307-310. - PubMed

WEB SITE REFERENCES

1. http://fantom2.gsc.riken.go.jp/; FANTOM2.
1. http://fantom2.gsc.riken.go.jp/db/; FANTOM2 cDNA annotation database.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Mouse Genome Informatics (MGI)

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CDS annotation in full-length cDNA sequence

Affiliation

CDS annotation in full-length cDNA sequence

Authors

Affiliation

Abstract

Figures

References

WEB SITE REFERENCES

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases