The Ensembl analysis pipeline

Simon C Potter¹, Laura Clarke, Val Curwen, Stephen Keenan, Emmanuel Mongin, Stephen M J Searle, Arne Stabenau, Roy Storey, Michele Clamp

Affiliations

PMID: 15123589
PMCID: PMC479123
DOI: 10.1101/gr.1859804

The Ensembl analysis pipeline

Simon C Potter et al. Genome Res. 2004 May.

. 2004 May;14(5):934-41.

doi: 10.1101/gr.1859804.

Authors

Simon C Potter¹, Laura Clarke, Val Curwen, Stephen Keenan, Emmanuel Mongin, Stephen M J Searle, Arne Stabenau, Roy Storey, Michele Clamp

Affiliation

¹ The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

PMID: 15123589
PMCID: PMC479123
DOI: 10.1101/gr.1859804

Abstract

The Ensembl pipeline is an extension to the Ensembl system which allows automated annotation of genomic sequence. The software comprises two parts. First, there is a set of Perl modules ("Runnables" and "RunnableDBs") which are 'wrappers' for a variety of commonly used analysis tools. These retrieve sequence data from a relational database, run the analysis, and write the results back to the database. They inherit from a common interface, which simplifies the writing of new wrapper modules. On top of this sits a job submission system (the "RuleManager") which allows efficient and reliable submission of large numbers of jobs to a compute farm. Here we describe the fundamental software components of the pipeline, and we also highlight some features of the Sanger installation which were necessary to enable the pipeline to scale to whole-genome analysis.

PubMed Disclaimer

Figures

**Figure 1**
Ensembl pipeline system overview: The RuleManager uses LSF to submit analysis jobs to the compute farm. When an individual job starts executing on a remote node, the Runner script fetches the job information from the database and recreates the Job object. This in turn creates a RunnableDB and calls the appropriate methods (fetch_input, run, write_output, etc.) to run the analysis.

**Figure 2**
The central loop of the RuleManager. The procedure the code goes though while submitting jobs to the compute farm is shown. (A) The main loop during which the RuleManager processes all of the input IDs, submits jobs which can run, processes the accumulators, and marks analyses which have completely finished. (B) The process an individual input ID undergoes; checking the input ID against each of the rules and submitting jobs for those rules which can be executed. Accumulators are a specific type of analysis which the RuleManager script recognizes and is able to deal with appropriately. (C) Accumulator analyses mark other analyses which need to be all complete on every possible input id of their type before any dependent analyses can be executed. This panel shows how the RuleManager processes the accumulators at the end of each loop. Those accumulators which both haven't been marked as incomplete when the input IDs were being processed and aren't already complete are submitted to the system.

**Figure 3**
Pipeline control flow. This figure provides an example of the dependencies which can exist within the system. Analyses which have one dependency must maintain input id type, but analyses which have multiple dependencies can alter their input id type if an `accumulator' analysis is used. (See the move from Swall to Similarity_genewise.)

See this image and copyright information in PMC

References

1. Benson, G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27: 573-580. - PMC - PubMed
1. Cuff, J.A., Coates, G.M.P., Cutts, T.J.R., and Rae, M. 2004. The Ensembl computing architecture. Genome Res. (this issue). - PMC - PubMed
1. Curwen, V., Eyras, E., Andrews, D.T., Clarke, L., Mongin, E., Searle, S., and Clamp, M. 2004. The Ensembl automatic gene annotation system. Genome Res. (this issue). - PMC - PubMed
1. Down, T.A. and Hubbard, T.J.P. 2002. Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res. 12: 458-461. - PMC - PubMed
1. Durbin, R. and Mieg, T. 1991. A C. elegans Database. Documentation, code and data available from anonymous FTP servers at http://lirmm.lirmm.fr, cele.mrc-lmb.cam.ac.uk and ncbi.nlm.nih.gov.

WEB SITE REFERENCES

1. http://www.acedb.org; AceDB.
1. http://www.platform.com/products/LSF; Load Sharing Facility.
1. http://www.ncbi.nlm.nih.gov/genome/guide/build.html#annot; NCBI's annotation pipeline.
1. http://vega.sanger.ac.uk; the Vertebrate Genome Annotation database.
1. http://cvsweb.sanger.ac.uk; Public CVS repository for the Ensembl software.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

WT_/Wellcome Trust/United Kingdom

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The Ensembl analysis pipeline

Affiliation

The Ensembl analysis pipeline

Authors

Affiliation

Abstract

Figures

References

WEB SITE REFERENCES

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources