Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006;7 Suppl 1(Suppl 1):S4.1-9.
doi: 10.1186/gb-2006-7-s1-s4. Epub 2006 Aug 7.

GENCODE: producing a reference annotation for ENCODE

Affiliations

GENCODE: producing a reference annotation for ENCODE

Jennifer Harrow et al. Genome Biol. 2006.

Abstract

Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.

Results: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.

Conclusion: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The GENCODE pipeline. This schematic diagram shows the flow of data between the three groups involved in the GENCODE consortium (HAVANA, IMIM and Geneva) to produce an experimentally verified annotation of the ENCODE region.
Figure 2
Figure 2
Experimental validation of HAVANA annotation. 'Known' and 'Novel_CDS' were submitted to 5' RACE, and 'Novel transcript' and 'Putative' loci were submitted to RT-PCR on all their exon junctions, followed by bi-directional RACE. Several steps of reannotation were performed during the process of experimental verification: the figure shows the update of the annotation between the first release in April 2005 and the release from October 2005.
Figure 3
Figure 3
Comparison of GENCODE transcript annotation with RefSeq and ENSEMBL. The exact agreement between GENCODE and RefSeq and GENCODE and ENSEMBL exons, introns, and nucleotides (NT) for the full transcripts or only the coding parts of the transcripts (CDS) is represented: in blue is the fraction found only in GENCODE, in green the fraction common between GENCODE and the other set (RefSeq or ENSEMBL) and in red the fraction found only in the other set (RefSeq or ENSEMBL) but not in GENCODE. The RefSeq set only contained the curated transcripts tagged with the NM prefix.
Figure 4
Figure 4
Comparison of GENCODE annotation with automated gene prediction methods. Viewed in Fmap of Acedb. Panel A shows the MAPK1 gene in ENr221. The GENCODE annotated gene structure is represented in green and red, the circled region highlights the different first exon identified by Pairagon (dark pink/blue) and the expanded region shows tiny introns (indicated by arrows) predicted by Ensembl (orange/red). Panel B shows the TRIM22 locus in ENm009. The structure predicted by Pairagon differs from the GENCODE structure and incorporates an unprocessed pseudogene as the final exon (circled). Panel C shows the human ANKRD43 locus in ENr221 for which AceView (light pink/blue), Pairagon and Ensembl all predict a shorter CDS than GENCODE. C ii shows the mouse ANKRD43 locus in which the upstream ATG is conserved. Panel D shows the GENCODE unprocessed pseudogene locus AC087380.14 at which Ensembl predicts a coding gene. The arrow indicates a tiny intron introduced into the prediction to splice around an in-frame premature stop codon. Panel E shows the IFNAR2 locus in ENm005 with GENCODE coding (red/green) and non-coding (all red) variants and AceView predictions. The AceView CDSs differ from GENCODE in several respects; arrow 'a' indicates several transcripts that have their CDS extended to the start of the prediction upstream of the GENCODE CDS start; arrow 'b' indicates a CDS starting in exon 5 despite the presence of an upstream ATG, which would seem to preclude (re-)initiation from this site; and arrow 'c' indicates a predicted stop codon in the fourth from last exon, which would be likely to make this transcript a target from Nonsense-mediated decay (NMD). GENCODE annotation incorporates all these variants but keeps them as transcripts as CDSs cannot be assigned with certainty. Panel F shows part of the olfactory receptor (OR) cluster in ENm009. Here Pairagon predicts a coding gene at the pseudogene locus OR52Z1P and a multi-exon gene that links separate OR loci (pseudogene locus OR51A1P, coding loci OR52A1 and OR52A5), indicated by arrows.

Similar articles

Cited by

  • Integrative annotation of chromatin elements from ENCODE data.
    Hoffman MM, Ernst J, Wilder SP, Kundaje A, Harris RS, Libbrecht M, Giardine B, Ellenbogen PM, Bilmes JA, Birney E, Hardison RC, Dunham I, Kellis M, Noble WS. Hoffman MM, et al. Nucleic Acids Res. 2013 Jan;41(2):827-41. doi: 10.1093/nar/gks1284. Epub 2012 Dec 5. Nucleic Acids Res. 2013. PMID: 23221638 Free PMC article.
  • mRNA profiling reveals determinants of trastuzumab efficiency in HER2-positive breast cancer.
    von der Heyde S, Wagner S, Czerny A, Nietert M, Ludewig F, Salinas-Riester G, Arlt D, Beißbarth T. von der Heyde S, et al. PLoS One. 2015 Feb 24;10(2):e0117818. doi: 10.1371/journal.pone.0117818. eCollection 2015. PLoS One. 2015. PMID: 25710561 Free PMC article.
  • The GENCODE pseudogene resource.
    Pei B, Sisu C, Frankish A, Howald C, Habegger L, Mu XJ, Harte R, Balasubramanian S, Tanzer A, Diekhans M, Reymond A, Hubbard TJ, Harrow J, Gerstein MB. Pei B, et al. Genome Biol. 2012 Sep 26;13(9):R51. doi: 10.1186/gb-2012-13-9-r51. Genome Biol. 2012. PMID: 22951037 Free PMC article.
  • Noncoding RNAs in apoptosis: identification and function.
    Tüncel Ö, Kara M, Yaylak B, Erdoğan İ, Akgül B. Tüncel Ö, et al. Turk J Biol. 2021 Nov 14;46(1):1-40. doi: 10.3906/biy-2109-35. eCollection 2022. Turk J Biol. 2021. PMID: 37533667 Free PMC article. Review.
  • HCV-Induced Epigenetic Changes Associated With Liver Cancer Risk Persist After Sustained Virologic Response.
    Hamdane N, Jühling F, Crouchet E, El Saghire H, Thumann C, Oudot MA, Bandiera S, Saviano A, Ponsolles C, Roca Suarez AA, Li S, Fujiwara N, Ono A, Davidson I, Bardeesy N, Schmidl C, Bock C, Schuster C, Lupberger J, Habersetzer F, Doffoël M, Piardi T, Sommacale D, Imamura M, Uchida T, Ohdan H, Aikata H, Chayama K, Boldanova T, Pessaux P, Fuchs BC, Hoshida Y, Zeisel MB, Duong FHT, Baumert TF. Hamdane N, et al. Gastroenterology. 2019 Jun;156(8):2313-2329.e7. doi: 10.1053/j.gastro.2019.02.038. Epub 2019 Mar 2. Gastroenterology. 2019. PMID: 30836093 Free PMC article.

References

    1. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. - PubMed
    1. Mattick JS. Non-coding RNAs: the architects of eukaryotic complexity. EMBO Rep. 2001;2:986–991. - PMC - PubMed
    1. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116:281–297. - PubMed
    1. ENCODE project consortium The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. - PubMed
    1. GENCODE Consortium http://genome.imim.es/gencode

Publication types