Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 23:2016:baw093.
doi: 10.1093/database/baw093. Print 2016.

The Ensembl gene annotation system

Affiliations

The Ensembl gene annotation system

Bronwen L Aken et al. Database (Oxford). .

Abstract

The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The Ensembl Genebuild workflow for annotating genes. The first phase of the annotation process is the Genome Preparation stage, which prepares the genome for gene annotation. The second phase is the Protein-coding Model Building stage, consisting of the Similarity, Targeted and RNA-seq pipelines. This generates a large set of potential protein-coding transcript models by aligning biological sequences to the genome and then inferring transcript models (exon–intron structures) using the alignments. Noncoding genes are annotated separately. Usually, the final phase is the Model Filtering stage. This involves sorting through the potential coding transcript models and filtering out those that are not well supported. Pseudogenes are then annotated and the noncoding RNA genes are incorporated to create the Ensembl gene set, which is then cross-referenced with external data sources. For some species (human, mouse, rat, zebrafish and pig) the HAVANA group produces manually curated gene sets. These annotations are merged with our Ensembl gene set to produce the final merged gene set. In the case of mouse and human, the merged sets comprise the GENCODE sets of genes.
Figure 2.
Figure 2.
The genome assembly. Vertebrate genome assemblies usually comprise a number of possible layers of information. In most cases, sequenced reads will be assembled into contigs. Contigs are assembled into scaffolds based on linkage data (e.g. paired reads, or markers), and these scaffolds may be assembled to produce chromosomes.
Figure 3.
Figure 3.
Projection of human FGF10 to alpaca. The FGF10 gene in alpaca was annotated by aligning the human and alpaca assemblies using BLASTz, and then projecting (copying) the human gene onto the alpaca genome. A novel structure, GeneScaffold_2975, was generated in the alpaca assembly by bringing together the shorter scaffolds that aligned to the human region containing the FGF10 gene.
Figure 4.
Figure 4.
Sample transcript models with supporting evidence for untranslated regions (UTRs). This figure shows sample transcript models from HAVANA (yellow) and Ensembl (red) aligned with supporting evidence from cDNAs (green), ESTs (purple) and proteins (orange). Darker colors in the alignments correspond with exons. Unfilled boxes at the ends of the transcripts represent UTRs. Support for the UTRs comes from the aligned cDNAs and ESTs but not from the proteins.
Figure 5.
Figure 5.
LayerAnnotation method. Candidate transcript models produced by each of the model-building pipelines are assigned varying levels of priority. In this example, models produced by the Targeted pipeline (which uses same-species protein data) are placed in Layer 1 and are therefore given preference over models with overlapping exons from the other model-building pipelines. Models produced using RNA-seq data are placed in Layer 2 and are given priority over those produced by the Similarity pipeline (which uses protein data from other species) in Layer 3. Final models indicate those selected for the final Ensembl gene set. (A) Candidate transcript models were produced by three model-building pipelines. The final protein-coding models were selected from Layer 1. Untranslated regions (unfilled boxes) were added from an RNA-seq model in Layer 2. The two transcript models will later be collapsed into a single gene model. (B) Layer 1 contains no model that overlaps with the model in Layer 2, and so the model in Layer 2 is the final model. (C) Layer 1 and Layer 2 contain no models that overlap with that in Layer 3, so the model in Layer 3 is selected as the final one.
Figure 6.
Figure 6.
Merging gene and transcript models. For both Ensembl and HAVANA models, transcripts with overlapping exons are grouped together into genes. (A) If the intron–exon boundaries, excluding UTRs, of a transcript from HAVANA completely match those of one from Ensembl the result is a merged transcript model, which is always based on the HAVANA annotation. If the intron–exon boundaries do not completely match then the two models are treated as separate transcripts belonging to the same gene. (B) Exons for a HAVANA gene overlap with those for an Ensembl gene. All transcripts are grouped together in the same merged gene. The intron–exon boundaries for one HAVANA and one Ensembl transcript match perfectly so they are merged to create the merged transcript shown in yellow. (C) Exons for Ensembl and HAVANA transcripts overlap but there are no transcripts with complete matching intron–exon boundaries. We still group the transcripts together into a merged gene but no transcripts are merged.
Figure 7.
Figure 7.
Annotation of patches. (A) Currently, we have two different types of patches: fix patches and novel patches. Both types are anchored to the assembly by shared sequence. Fix patches become part of the next major version of the assembly while novel patches remain as alternative sequence. (B) When annotating a novel patch, we first project gene models from the reference assembly. In this example, the HAVANA (red) and merged (yellow) genes are copied to the patch sequence. The Ensembl gene (blue) is not copied because the underlying genomic DNA is too different between the chromosome and the patch to enable the projection process. After projection, a patch will be annotated fully using the Ensembl annotation pipeline. In this case, two new gene models (green) have been annotated on the novel patch.

References

    1. Birney E., Andrews T.D., Bevan P. et al. (2004) An overview of Ensembl. Genome Res., 14, 925–928. - PMC - PubMed
    1. Guigó R., Flicek P., Abril J.F. et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol., 7(Suppl 1), S2.1–31. - PMC - PubMed
    1. Brent M.R. (2005) Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res., 15, 1777–1786. - PubMed
    1. Harrow J., Frankish A., Gonzalez J.M. et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res., 22, 1760–1774. - PMC - PubMed
    1. Frankish A., Uszczynska B., Ritchie G.R. et al. (2015) Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics, 16, S2.. - PMC - PubMed