The Ensembl gene annotation system

Bronwen L Aken¹, Sarah Ayling², Daniel Barrell³, Laura Clarke⁴, Valery Curwen⁵, Susan Fairley⁴, Julio Fernandez Banet⁶, Konstantinos Billis⁷, Carlos García Girón⁷, Thibaut Hourlier⁷, Kevin Howe⁴, Andreas Kähäri⁸, Felix Kokocinski⁵, Fergal J Martin⁷, Daniel N Murphy⁷, Rishi Nag⁷, Magali Ruffier⁴, Michael Schuster⁹, Y Amy Tang⁴, Jan-Hinnerk Vogel¹⁰, Simon White¹¹, Amonida Zadissa⁴, Paul Flicek⁷, Stephen M J Searle¹²

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK bronwen.aken@ebi.ac.uk smjsearle@yahoo.co.uk.
² Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Present addresses: The Genome Analysis Centre, Norwich Research Park, Norwich NR4 7UH, UK.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Eagle Genomics Ltd, Babraham Research Campus, Cambridge CB22 3AT, UK.
⁴ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
⁵ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
⁶ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Pfizer Inc, 10646 Science Center Dr, San Diego, CA 92121, USA.
⁷ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
⁸ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Institutionen för cell-och molekylärbiologi, Uppsala University, Husargatan 3, Uppsala 752 37, Sweden.
⁹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna a-1090, Austria.
¹⁰ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Genentech Inc, 1 DNA Way, South San Francisco, CA 94080, USA.
¹¹ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
¹² Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK bronwen.aken@ebi.ac.uk smjsearle@yahoo.co.uk.

PMID: 27337980
PMCID: PMC4919035
DOI: 10.1093/database/baw093

The Ensembl gene annotation system

Bronwen L Aken et al. Database (Oxford). 2016.

. 2016 Jun 23:2016:baw093.

doi: 10.1093/database/baw093. Print 2016.

Authors

Affiliations

¹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK bronwen.aken@ebi.ac.uk smjsearle@yahoo.co.uk.
² Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Present addresses: The Genome Analysis Centre, Norwich Research Park, Norwich NR4 7UH, UK.
³ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Eagle Genomics Ltd, Babraham Research Campus, Cambridge CB22 3AT, UK.
⁴ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
⁵ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
⁶ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Pfizer Inc, 10646 Science Center Dr, San Diego, CA 92121, USA.
⁷ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
⁸ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Institutionen för cell-och molekylärbiologi, Uppsala University, Husargatan 3, Uppsala 752 37, Sweden.
⁹ European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna a-1090, Austria.
¹⁰ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Genentech Inc, 1 DNA Way, South San Francisco, CA 94080, USA.
¹¹ Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
¹² Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK bronwen.aken@ebi.ac.uk smjsearle@yahoo.co.uk.

PMID: 27337980
PMCID: PMC4919035
DOI: 10.1093/database/baw093

Abstract

The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.

PubMed Disclaimer

Figures

**Figure 1.**
The Ensembl Genebuild workflow for annotating genes. The first phase of the annotation process is the Genome Preparation stage, which prepares the genome for gene annotation. The second phase is the Protein-coding Model Building stage, consisting of the Similarity, Targeted and RNA-seq pipelines. This generates a large set of potential protein-coding transcript models by aligning biological sequences to the genome and then inferring transcript models (exon–intron structures) using the alignments. Noncoding genes are annotated separately. Usually, the final phase is the Model Filtering stage. This involves sorting through the potential coding transcript models and filtering out those that are not well supported. Pseudogenes are then annotated and the noncoding RNA genes are incorporated to create the Ensembl gene set, which is then cross-referenced with external data sources. For some species (human, mouse, rat, zebrafish and pig) the HAVANA group produces manually curated gene sets. These annotations are merged with our Ensembl gene set to produce the final merged gene set. In the case of mouse and human, the merged sets comprise the GENCODE sets of genes.

**Figure 2.**
The genome assembly. Vertebrate genome assemblies usually comprise a number of possible layers of information. In most cases, sequenced reads will be assembled into contigs. Contigs are assembled into scaffolds based on linkage data (e.g. paired reads, or markers), and these scaffolds may be assembled to produce chromosomes.

**Figure 3.**
Projection of human *FGF10* to alpaca. The *FGF10* gene in alpaca was annotated by aligning the human and alpaca assemblies using BLASTz, and then projecting (copying) the human gene onto the alpaca genome. A novel structure, GeneScaffold_2975, was generated in the alpaca assembly by bringing together the shorter scaffolds that aligned to the human region containing the *FGF10* gene.

**Figure 4.**
Sample transcript models with supporting evidence for untranslated regions (UTRs). This figure shows sample transcript models from HAVANA (yellow) and Ensembl (red) aligned with supporting evidence from cDNAs (green), ESTs (purple) and proteins (orange). Darker colors in the alignments correspond with exons. Unfilled boxes at the ends of the transcripts represent UTRs. Support for the UTRs comes from the aligned cDNAs and ESTs but not from the proteins.

**Figure 5.**
LayerAnnotation method. Candidate transcript models produced by each of the model-building pipelines are assigned varying levels of priority. In this example, models produced by the Targeted pipeline (which uses same-species protein data) are placed in Layer 1 and are therefore given preference over models with overlapping exons from the other model-building pipelines. Models produced using RNA-seq data are placed in Layer 2 and are given priority over those produced by the Similarity pipeline (which uses protein data from other species) in Layer 3. Final models indicate those selected for the final Ensembl gene set. (A) Candidate transcript models were produced by three model-building pipelines. The final protein-coding models were selected from Layer 1. Untranslated regions (unfilled boxes) were added from an RNA-seq model in Layer 2. The two transcript models will later be collapsed into a single gene model. (B) Layer 1 contains no model that overlaps with the model in Layer 2, and so the model in Layer 2 is the final model. (C) Layer 1 and Layer 2 contain no models that overlap with that in Layer 3, so the model in Layer 3 is selected as the final one.

**Figure 6.**
Merging gene and transcript models. For both Ensembl and HAVANA models, transcripts with overlapping exons are grouped together into genes. (A) If the intron–exon boundaries, excluding UTRs, of a transcript from HAVANA completely match those of one from Ensembl the result is a merged transcript model, which is always based on the HAVANA annotation. If the intron–exon boundaries do not completely match then the two models are treated as separate transcripts belonging to the same gene. (B) Exons for a HAVANA gene overlap with those for an Ensembl gene. All transcripts are grouped together in the same merged gene. The intron–exon boundaries for one HAVANA and one Ensembl transcript match perfectly so they are merged to create the merged transcript shown in yellow. (C) Exons for Ensembl and HAVANA transcripts overlap but there are no transcripts with complete matching intron–exon boundaries. We still group the transcripts together into a merged gene but no transcripts are merged.

**Figure 7.**
Annotation of patches. (A) Currently, we have two different types of patches: fix patches and novel patches. Both types are anchored to the assembly by shared sequence. Fix patches become part of the next major version of the assembly while novel patches remain as alternative sequence. (B) When annotating a novel patch, we first project gene models from the reference assembly. In this example, the HAVANA (red) and merged (yellow) genes are copied to the patch sequence. The Ensembl gene (blue) is not copied because the underlying genomic DNA is too different between the chromosome and the patch to enable the projection process. After projection, a patch will be annotated fully using the Ensembl annotation pipeline. In this case, two new gene models (green) have been annotated on the novel patch.

See this image and copyright information in PMC

References

1. Birney E., Andrews T.D., Bevan P. et al. (2004) An overview of Ensembl. Genome Res., 14, 925–928. - PMC - PubMed
1. Guigó R., Flicek P., Abril J.F. et al. (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol., 7(Suppl 1), S2.1–31. - PMC - PubMed
1. Brent M.R. (2005) Genome annotation past, present, and future: how to define an ORF at each locus. Genome Res., 15, 1777–1786. - PubMed
1. Harrow J., Frankish A., Gonzalez J.M. et al. (2012) GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res., 22, 1760–1774. - PMC - PubMed
1. Frankish A., Uszczynska B., Ritchie G.R. et al. (2015) Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics, 16, S2.. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Ensembl gene annotation system

Affiliations

The Ensembl gene annotation system

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources