Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2004 May;14(5):942-50.
doi: 10.1101/gr.1858004.

The Ensembl automatic gene annotation system

Affiliations
Comparative Study

The Ensembl automatic gene annotation system

Val Curwen et al. Genome Res. 2004 May.

Abstract

As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser (http://www.ensembl.org). To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from http://www.ensembl.org.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of Ensembl Gene Build. Most genes are predicted using the sequences of known proteins aligned to the genome using genewise (Targetted and Similarity builds). UTR sequences for these genes are derived from the alignment of cDNAs to the genomic sequence (Exonerate, cDNA Gene Build). Transcripts created in this manner are then clustered to form genes (GeneBuilder). Finally, novel genes supported solely by cDNA evidence are added to the gene set, which is written to the database.
Figure 2
Figure 2
The Miniseq: We use a miniseq representation of genomic sequence in various stages of the gene build in order to reduce search space and increase processing speed. We BLAST a sequence of interest against a genomic region and pad the resulting hits with 200 bp. We then join the padded hits together to form a “mini genomic” sequence containing only exon sequence plus a small amount of intron sequence.
Figure 3
Figure 3
Rules for adding UTRs to genewise predictions: (A) Simplest case: Ends of exons A and D coincide, thus exon A is extended to include the UTR and the translation start is maintained. Starts of exons C and F coincide, thus UTR exons are added and the translation stop is maintained. The coordinates of genewise-derived exon B are used in preference over exon F. (B) cDNA prediction rejected: Neither the ends of exons G and I nor the starts of exons H and J coincide, so the genewise-predicted structure is unmodified. (C) cDNA prediction with short exons: The ends of the exons K and M and the starts of exons L and N coincide. Even though K is shorter than M, it is not the first exon of the cDNA prediction and is thus retained. However, N is shorter than L and there are no additional exons, so it is rejected.

References

    1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed
    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402. - PMC - PubMed
    1. Birney, E., Clamp, M., and Durbin, R. 2004. Genewise and genomewise. Genome Res. (this issue). - PMC - PubMed
    1. Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365-370. - PMC - PubMed
    1. Boguski, M.S., Lowe, T.M., and Tolstoshev, C.M. 1993. dbEST–Database for expressed sequence tags. Nat. Genet. 4: 332-333. - PubMed

Publication types

MeSH terms