eHive: an artificial intelligence workflow system for genomic analysis

Jessica Severin¹, Kathryn Beal, Albert J Vilella, Stephen Fitzgerald, Michael Schuster, Leo Gordon, Abel Ureta-Vidal, Paul Flicek, Javier Herrero

Affiliations

PMID: 20459813
PMCID: PMC2885371
DOI: 10.1186/1471-2105-11-240

eHive: an artificial intelligence workflow system for genomic analysis

Jessica Severin et al. BMC Bioinformatics. 2010.

. 2010 May 11:11:240.

doi: 10.1186/1471-2105-11-240.

Authors

Jessica Severin¹, Kathryn Beal, Albert J Vilella, Stephen Fitzgerald, Michael Schuster, Leo Gordon, Abel Ureta-Vidal, Paul Flicek, Javier Herrero

Affiliation

¹ European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, UK.

PMID: 20459813
PMCID: PMC2885371
DOI: 10.1186/1471-2105-11-240

Abstract

Background: The Ensembl project produces updates to its comparative genomics resources with each of its several releases per year. During each release cycle approximately two weeks are allocated to generate all the genomic alignments and the protein homology predictions. The number of calculations required for this task grows approximately quadratically with the number of species. We currently support 50 species in Ensembl and we expect the number to continue to grow in the future.

Results: We present eHive, a new fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams. In the eHive system a MySQL database serves as the central blackboard and the autonomous agent, a Perl script, queries the system and runs jobs as required. The system allows us to define dataflow and branching rules to suit all our production pipelines. We describe the implementation of three pipelines: (1) pairwise whole genome alignments, (2) multiple whole genome alignments and (3) gene trees with protein homology inference. Finally, we show the efficiency of the system in real case scenarios.

Conclusions: eHive allows us to produce computationally demanding results in a reliable and efficient way with minimal supervision and high throughput. Further documentation is available at: http://www.ensembl.org/info/docs/eHive/.

PubMed Disclaimer

Figures

**Figure 1**
**eHive system overview**. The eHive system is based on a Blackboard System implemented as a MySQL database. It contains a list of all the jobs to run as well as dataflow and branching rules. The operator monitors and controls the system using a program called beekeeper. It connects to the blackboard and creates workers as required. Workers run in a queuing environment, typically LSF. They run jobs for a particular analysis until no more jobs are available or they reach the end of their one hour lifespan. The eHive also keeps track of the throughput of the pipeline as it runs.

**Figure 3**
**Pairwise Alignment Pipeline**. Each analysis is represented by a blue box. The blue arrows show the flow of information from one analysis to the other, either using the dataflow rules (solid arrow) or by massive creation of new jobs as part of the analysis (dashed arrows). Red arrows represent control rules, i.e. analyses that cannot start until the previous one has finished. Black arrows show the creation of new analyses during the execution of the pipeline. Turquoise arrows show alternative paths taken when a particular job fails. The green arrows mark the initial jobs required to run the pipeline. **(A)** First part of the Pairwise alignment pipeline where we build the set of raw alignments. First, the ChunkAndGroupDna module creates one DNA Collection for each genome. CreatePairAlignerJobs and CreateFilterDuplicateJobs create the BlastZ, QueryFilterDuplicates and TargetFilterDuplicates for these DNA Collections. The BlastZ analysis runs all the BLAST [16] jobs. In order to avoid border effects due to the initial chunking process of long chromosomes, we allow partially overlapping chunks. The QueryFilterDuplicates and TargetFilterDuplicates analyses remove the duplicates and resolve the inconsistencies in the overlap between these chunks of sequences. UpdateMaxAlignmentLength analyses are needed to perform efficient ''region queries'' in a MySQL database. **(B)** Second part of the Pairwise alignment pipeline where raw alignments are chained and netted. The DumpLargeNibForChains module formats the input files for the axtChain program. The CreateAlignmentChainJobs process creates one AlignmentChains job per pair of genomic segments. The netting is performed using the same strategy: a single CreateAlignmentNetJobs job creates all the AlignmentNets jobs. Last, the PairwiseHealthCheck analysis runs a set of sanity tests on the resulting data. **(C)** Timeline of this pipeline when aligning the human and the pika genomes.

**Figure 4**
**Multiple alignments pipeline**. Colours and conventions are used as in Figure 3. **(A)**. This pipeline can be divided in 4 blocks. In the first part there is one job per species, which prepares all of the BLAST jobs. The second part (one job per coding exon) runs the BLAST jobs. In the third part, Mercator builds the orthology map using all the previous BLAST results. In the last part, each Mercator block is aligned with Pecan and GERP defines the local conservation in each alignment. In this pipeline, the SubmitPep_X_Species and blast_X_Species analyses are created dynamically by the GenomeSubmitPep and GenomeDumpFasta jobs respectively. GenomeLoadExonMembers (1 job per species) loads all the coding exons and create 1 GenomeSubmitPep and 1 GenomeDumpFasta job for each genome. The GenomeSubmitPep analysis creates 1 SubmitPep_X_Species analysis per genome and all the jobs for each of these analyses. GenomeDumpFasta creates a BLAST database for each set of coding exons and the corresponding blast_X_Species analysis. CreateBlastRules creates all the dataflow rules between the SubmitPep_X_Species and the blast_X_Species for all the other species. Mercator builds the orthology map using the results of all the previous BLAST jobs and then Pecan aligns all the orthologous genomic segments in each Mercator block. Last, the Gerp is run on each Pecan alignment. **(B)** Timeline for the 12-way Multiple alignment pipeline.

**Figure 5**
**The GeneTree pipeline**. Colours and conventions are used as in Figure 3. (A) The figure shows only the second half of the pipeline as the first part is very similar to the first two blocks of the Multiple alignment pipeline (panel 4A). The main difference is that we use the BLAST between the whole proteins as a block instead of splitting them in coding exons. In short, the proteins are clustered and aligned and a phylogenetic tree is built on top of each alignment. Then, the OrthoTree module calls orthologues and paralogues and the last 3 modules handle the calculation of dN/dS values for pairs of proteins. This pipeline contains alternative routes depicted in turquoise used when some particular exceptions are thrown, namely when Muscle is unable to align all the proteins in a cluster or when TreeBeST cannot infer the phylogenetic tree. This can happen when the cluster of proteins is too large. We use the BreakPAFCluster module to split these clusters in sub-groups and restart the alignment. **(B)** Timeline for the GeneTree pipeline. This figure shows the progress of the GeneTree pipeline for Ensembl release 49 (39 species). The pipeline is monitored approximately every 2 minutes. BLAST and SubmitPep jobs co-occur in one phase of the pipeline. In another phase, Muscle, TreeBeST and OrthoTree also run at the same time.

See this image and copyright information in PMC

References

1. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S, Flicek P. Ensembl 2009. Nucleic Acids Res. 2009;37:D690–D697. doi: 10.1093/nar/gkn828. - DOI - PMC - PubMed
1. Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A. BioMart--biological queries made easy. BMC Genomics. 2009;10:22. doi: 10.1186/1471-2164-10-22. - DOI - PMC - PubMed
1. Reynolds CW. Flocks, herds and schools: A distributed behavioral model. Proceedings of the 14th annual conference on Computer graphics and interactive techniques. 1987. pp. 25–34. full_text.
1. Nii HP. The blackboard model of problem solving and the evolution of blackboard architectures. AI Magazine. 1986;7:38–53.
1. Nwana HS. Software agents: An overview. Knowledge Engineering Review. 1996;11:205–244. doi: 10.1017/S026988890000789X. - DOI

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

eHive: an artificial intelligence workflow system for genomic analysis

Affiliation

eHive: an artificial intelligence workflow system for genomic analysis

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources