Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2004 Mar;14(3):463-71.
doi: 10.1101/gr.1481104. Epub 2004 Feb 12.

Numerous novel annotations of the human genome sequence supported by a 5'-end-enriched cDNA collection

Affiliations
Comparative Study

Numerous novel annotations of the human genome sequence supported by a 5'-end-enriched cDNA collection

Betina M Porcel et al. Genome Res. 2004 Mar.

Abstract

A collection of 90,000 human cDNA clones generated to increase the fraction of "full-length" cDNAs available was analyzed by sequence alignment on the human genome assembly. Five hundred fifty-two gene models not found in LocusLink, with coding regions of at least 300 bp, were defined by using this collection. Exon composition proposed for novel genes showed an average of 4.7 exons per gene. In 20% of the cases, at least half of the exons predicted for new genes coincided with evolutionary conserved regions defined by sequence comparisons with the pufferfish Tetraodon nigroviridis. Among this subset, CpG islands were observed at the 5' end of 75%. In-frame stop codons upstream of the initiator ATG were present in 49% of the new genes, and 16% contained a coding region comprising at least 50% of the cDNA sequence. This cDNA resource also provided candidate small protein-coding genes, usually not included in genome annotations. In addition, analysis of a sample from this cDNA collection indicates that approximately 380 gene models described in LocusLink could be extended at their 5' end by at least one new exon. Finally, this cDNA resource provided an experimental support for annotations based exclusively on predictions, thus representing a resource substantially improving the human genome annotation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Schematic outline of the strategy used to confirm the candidate genes. Numbers between brackets correspond to gene models and/or proposed gene models in each category. Description of the strategy followed for manual curation is shown in Results.
Figure 2
Figure 2
(A) Example of a new gene supported by the CNSLT cDNA resource. Human curation of the identified gene models was performed by using a graphical interface. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). An empty arrow indicates the CNSLT cDNA clone used for the construction of the proposed gene model. Filled arrows indicate the cDNA clones assembled on the genome. CpG islands are represented by green boxes; human-Tetraodon ecores, by orange boxes. Coding regions with an in-frame stop codon upstream of an initiator ATG are represented by magenta bars. When such stop codons could not be identified, the coding regions are represented by pale turquoise bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_100AA indicates CDS of at least 300 bp; CDS_SHORT, coding region spanning <50% of the model sequence; and ALT_SP, alternative splicing. (B) Example of a putative gene. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). A human–Tetraodon ecore, which supports the first exon of the proposed gene model, is represented by an orange box. A coding region of 64 amino acids with an in-frame stop codon upstream of an initiator ATG is represented by magenta bars. Matches to the mouse genome are indicated by black bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_LESS_100AA indicates CDS <300 bp. (C) Example of an extension of an already annotated gene, using the CNSLT cDNA resource. The transcript models of cDNA clones or RefSeq and GenBank transcripts are represented by filled blue bars. CpG islands are represented by green boxes, and human–Tetraodon ecores by yellow boxes. The filled arrow points to the CNSLT cDNA clone extending the annotated gene. (Inset, right) The exons predicted by the alignment of the virtual cDNA sequence and the human genome assembly, using the sim4 algorithm (for detailed explanation, see text). Color code for coding regions: stop-ATG-stop, magenta bars; ATG-stop, pale turquoise bars; and stop-ATG, white bars. (Inset, left) The extension of the CDS from BC027478 (red box) using the CNSLT cDNA resource.
Figure 2
Figure 2
(A) Example of a new gene supported by the CNSLT cDNA resource. Human curation of the identified gene models was performed by using a graphical interface. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). An empty arrow indicates the CNSLT cDNA clone used for the construction of the proposed gene model. Filled arrows indicate the cDNA clones assembled on the genome. CpG islands are represented by green boxes; human-Tetraodon ecores, by orange boxes. Coding regions with an in-frame stop codon upstream of an initiator ATG are represented by magenta bars. When such stop codons could not be identified, the coding regions are represented by pale turquoise bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_100AA indicates CDS of at least 300 bp; CDS_SHORT, coding region spanning <50% of the model sequence; and ALT_SP, alternative splicing. (B) Example of a putative gene. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). A human–Tetraodon ecore, which supports the first exon of the proposed gene model, is represented by an orange box. A coding region of 64 amino acids with an in-frame stop codon upstream of an initiator ATG is represented by magenta bars. Matches to the mouse genome are indicated by black bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_LESS_100AA indicates CDS <300 bp. (C) Example of an extension of an already annotated gene, using the CNSLT cDNA resource. The transcript models of cDNA clones or RefSeq and GenBank transcripts are represented by filled blue bars. CpG islands are represented by green boxes, and human–Tetraodon ecores by yellow boxes. The filled arrow points to the CNSLT cDNA clone extending the annotated gene. (Inset, right) The exons predicted by the alignment of the virtual cDNA sequence and the human genome assembly, using the sim4 algorithm (for detailed explanation, see text). Color code for coding regions: stop-ATG-stop, magenta bars; ATG-stop, pale turquoise bars; and stop-ATG, white bars. (Inset, left) The extension of the CDS from BC027478 (red box) using the CNSLT cDNA resource.
Figure 2
Figure 2
(A) Example of a new gene supported by the CNSLT cDNA resource. Human curation of the identified gene models was performed by using a graphical interface. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). An empty arrow indicates the CNSLT cDNA clone used for the construction of the proposed gene model. Filled arrows indicate the cDNA clones assembled on the genome. CpG islands are represented by green boxes; human-Tetraodon ecores, by orange boxes. Coding regions with an in-frame stop codon upstream of an initiator ATG are represented by magenta bars. When such stop codons could not be identified, the coding regions are represented by pale turquoise bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_100AA indicates CDS of at least 300 bp; CDS_SHORT, coding region spanning <50% of the model sequence; and ALT_SP, alternative splicing. (B) Example of a putative gene. The transcript models of cDNA clones are represented by blue bars, whereas the model proposed for the gene is represented at the bottom of the figure by red bars (for a detailed explanation, see Results). A human–Tetraodon ecore, which supports the first exon of the proposed gene model, is represented by an orange box. A coding region of 64 amino acids with an in-frame stop codon upstream of an initiator ATG is represented by magenta bars. Matches to the mouse genome are indicated by black bars. Annotations found for this proposed gene model are listed in the boxed part of the figure. PROT_LESS_100AA indicates CDS <300 bp. (C) Example of an extension of an already annotated gene, using the CNSLT cDNA resource. The transcript models of cDNA clones or RefSeq and GenBank transcripts are represented by filled blue bars. CpG islands are represented by green boxes, and human–Tetraodon ecores by yellow boxes. The filled arrow points to the CNSLT cDNA clone extending the annotated gene. (Inset, right) The exons predicted by the alignment of the virtual cDNA sequence and the human genome assembly, using the sim4 algorithm (for detailed explanation, see text). Color code for coding regions: stop-ATG-stop, magenta bars; ATG-stop, pale turquoise bars; and stop-ATG, white bars. (Inset, left) The extension of the CDS from BC027478 (red box) using the CNSLT cDNA resource.

References

    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410. - PubMed
    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389–3402. - PMC - PubMed
    1. Antequera, F. and Bird, A. 1999. CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol. 9: R661–667. - PubMed
    1. Brent, M.R. 2002. Predicting full-length transcripts. Trends Biotechnol. 20: 273–275. - PubMed
    1. Collins, J.E., Goward, M.E., Cole, C.G., Smink, L.J., Huckle, E.J., Knowles, S., Bye, J.M., Beare, D.M., and Dunham, I. 2003. Reevaluating human gene annotation: A second-generation analysis of chromosome 22. Genome Res. 13: 27–36. - PMC - PubMed

WEB SITE REFERENCES

    1. http://compbio.ornl.gov/grailexp; Grail Experimental Gene Discovery Suite Web site.
    1. http://www.ensembl.org/EnsMart/; EnsMart data mining toolset retrieval of annotated genomes.

Publication types

MeSH terms

Associated data

LinkOut - more resources