Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 12;12(4):e0175310.
doi: 10.1371/journal.pone.0175310. eCollection 2017.

SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata

Affiliations

SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata

Benjamin C Hitz et al. PLoS One. .

Abstract

The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Object graph and embedding.
The normalized object graph (linked set of documents) stored in PostgreSQL is framed as an expanded JSON-LD document (embedded set of documents) before indexing into Elasticsearch and rendering in JavaScript.
Fig 2
Fig 2. Software Stack.
Schematic diagram of software stack showing different paths for page rendering. The most efficient rendering is with the HTML rendered on the server and the embedded documents indexed in Elasticsearch.
Fig 3
Fig 3. Rendering Overview.
Initial page loads are rendered to HTML on the server for immediate display on the client. Once the JavaScript is fully loaded, subsequent page loads can be rendered on the client.
Fig 4
Fig 4. Example of the “missing donor” audit.
This audit appears on relevent web pages (and in JSON return from the API), alerting submitters and DCC data wranglers that this epidermal keratinocyte biosample is missing the (human) donor object from whom it was derived.
Fig 5
Fig 5. The ENCODE portal data matrix.
This page shows all experiments released at the ENCODE Portal, including the Roadmap for Epigenomic Mapping Consortium (REMC). Experiments are organized by their biosample (tissue, cell or cell line) on the Y axis and by Assay type on the X-axis (A) Facets select specific properties, such as target (histone, transcription factor) or experimental type (ChIP-seq, RNA-seq, etc.) (B) Facets apply specifically to the biosample, including organism (human, mouse, fly, worm), type (tissue, immortalized cell line, stem cell, etc.) or organ system (as inferred from ontological relations).
Fig 6
Fig 6. The ENCODE portal report page.
This page shows where metadata can be downloaded in spreadsheet format (csv). Report views exist for all collection searches, for example, Experiment, Biosample or Antibody. (A) Standard ENCODE Assay facets to filter the rows of interest (in this case Experiments). (B) Toggle between the report, matrix, and standard search output. (C) Select the columns (individual properties) that will appear in the columns of your spreadsheet.

References

    1. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431: 931–945. doi: 10.1038/nature03001 - DOI - PubMed
    1. Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, et al. ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013;41: D56–63. doi: 10.1093/nar/gks1172 - DOI - PMC - PubMed
    1. Ho JWK, Jung YL, Liu T, Alver BH, Lee S, Ikegami K, et al. Comparative analysis of metazoan chromatin organization. Nature. 2014;512: 449–452. doi: 10.1038/nature13415 - DOI - PMC - PubMed
    1. Boyle AP, Araya CL, Brdlik C, Cayting P, Cheng C, Cheng Y, et al. Comparative analysis of regulatory information and circuits across distant species. Nature. 2014;512: 453–456. doi: 10.1038/nature13668 - DOI - PMC - PubMed
    1. Gerstein MB, Rozowsky J, Yan K-K, Wang D, Cheng C, Brown JB, et al. Comparative analysis of the transcriptome across distant species. Nature. 2014;512: 445–448. doi: 10.1038/nature13424 - DOI - PMC - PubMed