Petabyte-scale innovations at the European Nucleotide Archive

Affiliations

PMID: 18978013
PMCID: PMC2686451
DOI: 10.1093/nar/gkn765

Petabyte-scale innovations at the European Nucleotide Archive

Guy Cochrane et al. Nucleic Acids Res. 2009 Jan.

. 2009 Jan;37(Database issue):D19-25.

doi: 10.1093/nar/gkn765. Epub 2008 Oct 31.

Affiliation

¹ EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. cochrane@ebi.ac.uk

PMID: 18978013
PMCID: PMC2686451
DOI: 10.1093/nar/gkn765

Abstract

Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.

PubMed Disclaimer

Figures

**Figure 1.**
ENA structure. The figure shows how nucleotide sequencing information is partitioned according to class; ENA-Reads treats raw sequencing information, ENA-Assembly treats information on how fragmented sequences have been assembled into higher order structures and ENA-Annotation treats functional annotation based on assembled sequence. The three components are integrated in the ENA.

**Figure 2.**
Webin. The figure shows a selection of screenshots from Webin; (a) launcher page, (b) submissions page, (c) source feature page and (d) new feature addition page.

**Figure 3.**
Throughput for validated ENA-Annotation entries. The figure shows cumulative counts of ENA-Annotation entries that have been processed by ENA biologists.

**Figure 4.**
Structure of ENA-Reads. A relational data model has been developed for next generation sequencing data that relates the concept of a study to samples that have been used for the study, to runs that have been executed as part of the experiments that make up the study and describe the details of how samples have been configured in runs. Underlying this data model is an API that provides abstraction from the nature of the data file system, returning read data upon request based on read identifiers (and groupings of these identifiers), rather than on specified files within the file system.

See this image and copyright information in PMC

References

1. The UniProt Consortium. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2008;36:D190–D195. - PMC - PubMed
1. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2008. Nucleic Acids Res. 2008;36:D707–D714. - PMC - PubMed
1. Sugawara H, Ogasawara O, Okubo K, Gojobori T, Tateno Y. DDBJ with new system and face. Nucleic Acids Res. 2008;36:D22–D24. - PMC - PubMed
1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2008;36:D25–D30. - PMC - PubMed
1. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008;36:D13–D21. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Petabyte-scale innovations at the European Nucleotide Archive

Affiliation

Petabyte-scale innovations at the European Nucleotide Archive

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources