Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 May;14(5):971-5.
doi: 10.1101/gr.1866304.

The Ensembl computing architecture

Affiliations

The Ensembl computing architecture

James A Cuff et al. Genome Res. 2004 May.

Abstract

Ensembl is a software project to automatically annotate large eukaryotic genomes and release them freely into the public domain. The project currently automatically annotates 10 complete genomes. This makes very large demands on compute resources, due to the vast number of sequence comparisons that need to be executed. To circumvent the financial outlay often associated with classical supercomputing environments, farms of multiple, lower-cost machines have now become the norm and have been deployed successfully with this project. The architecture and design of farms containing hundreds of compute nodes is complex and nontrivial to implement. This study will define and explain some of the essential elements to consider when designing such systems. Server architecture and network infrastructure are discussed with a particular emphasis on solutions that worked and those that did not (often with fairly spectacular consequences). The aim of the study is to give the reader, who may be implementing a large-scale biocompute project, an insight into some of the pitfalls that may be waiting ahead.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Ensembl Computing infrastructure.
Figure 2
Figure 2
Database layout on an 8-node cluster. To enable distribution of computational load, remote devices can access the cluster alias to read from the replicated database. If write access is required, individual nodes must be specified. The cluster alias access is very efficient for large SQL select statements where speed is required.
Figure 3
Figure 3
Client code in wait state. Note, the CPU time is pitiful, the process state is in WAIT, as it is waiting for I/O operations on the NFS server. This can be seen from the kernel messages file above, which also shows the server (master) to be unresponsive.
Figure 4
Figure 4
Server code in kernel thread state. The kernel idle process here is the NFS kernel thread in the server trying desperately to serve NFS requests.
Figure 5
Figure 5
Traditional vs. SAN storage for clusters.

References

    1. Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. - PubMed
    1. Fox, G., Williams, R., and Messina, P. 1994. Parallel computing works. Morgan Kaufmann, San Francisco, CA.
    1. Knaff, A. 2003. Udpcast. http://udpcast.linux.lu.
    1. Moore, G. 1965. Cramming more components onto integrated circuits. Electronics 38: 114-117.
    1. Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453. - PubMed

Publication types

LinkOut - more resources