The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details

Nicole L Washington¹, E O Stinson, Marc D Perry, Peter Ruzanov, Sergio Contrino, Richard Smith, Zheng Zha, Rachel Lyne, Adrian Carr, Paul Lloyd, Ellen Kephart, Sheldon J McKay, Gos Micklem, Lincoln D Stein, Suzanna E Lewis

Affiliations

PMID: 21856757
PMCID: PMC3170170
DOI: 10.1093/database/bar023

The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details

Nicole L Washington et al. Database (Oxford). 2011.

. 2011 Aug 19:2011:bar023.

doi: 10.1093/database/bar023. Print 2011.

Authors

Affiliation

¹ Lawrence Berkeley National Laboratory, Genomics Division, 1 Cyclotron Road MS64-121, Berkeley, CA 94720, USA.

PMID: 21856757
PMCID: PMC3170170
DOI: 10.1093/database/bar023

Abstract

The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and analyzed data, and ensure the community is supplied with the knowledge of the experimental conditions, protocols and verification checks used to generate each primary data set. We present here the design principles of the modENCODE DCC, and describe the ramifications of collecting thorough and deep metadata for describing experiments, including the use of a wiki for capturing protocol and reagent information, and the BIR-TAB specification for linking biological samples to experimental results. modENCODE data can be found at http://www.modencode.org.

PubMed Disclaimer

Figures

**Figure 1.**
DCC workflow. Submitting data to the modENCODE DCC can be divided into four parts. It begins with discussions between a data provider and a DCC curator to determine the required metadata and data formats for a given category of submission. Once the submission template is made, the data provider can prepare and submit a data set to the DCC. The data set undergoes a series of automated and manual QC checks. If the submission does not pass these steps, it is returned to the data provider and/or the DCC curator for modification. Once a submission satisfies all requirements, and is approved by the DCC and data submitter, it is distributed to the community through the GBrowse genome browser, modMine query interface, graphical submission filtering tool and the public repositories of FB, WB and GEO.

**Figure 2.**
A model experiment submitted to the modENCODE DCC and its mapping to metadata components BIR-TAB SDRF and the wiki. The top half is a diagram of experimental steps for a model ChIP-seq experiment: a worm culture is prepared, the genomic DNA associated with chromatin is extracted, followed by division of the extraction into two biological replicates. These are further subdivided, with half of each DNA sample used as a control, while the other is exposed to a specific TF antibody in a ChIP step. The resulting materials are prepared for sequencing, and the data processed to identify the set of binding sites occupied by the TF tested. The corresponding BIR-TAB SDRF is shown in the bottom half, and mirrors the flow of experimental steps as indicated by the green (output) and blue (input) arrows. The inputs and outputs are the arcs connecting each protocol node of an experiment represented in the database. Each cell in a protocol column of the BIR-TAB file maps to a specific wiki page where the inputs and outputs of that protocol have been indicated. Most experimental parameters, such as strain and antibody, are also specified in the wiki. A reference to the wiki for these experimental parameters or results is indicated with a Term Source REF column immediately following the parameter column.

**Figure 3.**
Screenshot of a modENCODE wiki tissue page using DBFields template. In this example, WormBase cell and anatomy ontology (24) terms are selected to describe unc-4 expressing neurons in the L3 stage. The DBFields template for tissues was configured to include fields for a colloquial name, species, sex, tissue, contributing lab and related external URLs. The tissue field allows for multiple selections from the configured ontology; as the user starts to type a phrase (such as AVF), partial matches are displayed for selection and the corresponding definition is displayed on the right. After the user ‘Updates’ the form to accept the changes, an updated URL is displayed for the user to refer specifically to this version of the wiki page. This URL is used in the BIR-TAB metadata documents to describe the sample, and the vetting software retrieves the field values during processing.

**Figure 4.**
modENCODE data submission statistics. (A) Distribution of wiki page types. Number of wiki pages used in released submissions (dark gray) out of the total set, which have been entered in the wiki. The unused set of wiki pages may be used in future submissions. Data were only from released data sets, and not those superseded, deprecated or rejected. (B) Distribution of submission package sizes. Scatterplot of individual package sizes (in GB, scale on left) are overlaid with the cumulative size of all modENCODE data (in TB, scale on right), over the course of the project. Black indicates the size of the files uploaded into the system by data providers, and is the minimal set required for backup; red indicates the total size of a processed submission, including gbrowse tracks, chadoxml and all versions of uploaded data, and is the maximum size required to maintain a complete history. (C) Composition of modENCODE data types. These are based on the cumulative submission file sizes in each category, including data sets that have been superseded, replaced and rejected. (D) Number of submissions over time. Plot reveals spikes in data submission. Dotted lines indicate when submissions were initially created; solid lines indicate when submissions were released in the pipeline. Red lines show cumulative counts; black lines show the number of counts per week. Events, such as scientific meetings or data freezes are indicated with blue circles. Project quarters are indicated (Year 1 Quarter 4 is abbreviated Y1Q4). All data, including superseded, replaced and rejected submissions, are shown. (E) Pipeline processing times grouped by data type. Average processing times (in minutes) for the three pipeline steps (validation, database loading and track finding) are shown for each type of data in released data sets.

**Figure 5.**
modENCODE submission interface. (A) The primary page for an example individual submission is shown. (B) New submissions are created by entering a name for the submission and selecting the appropriate laboratory and PI. (C) Once a submission is created, the current details are listed on the upper left side of the page. (D) The step-by-step series of tasks that are being executed by the pipeline can be monitored in real time, and the corresponding output from each module can be viewed. (E) Progress is indicated as the submission moves through each step of automated QC processing. In this example, all that remains to be done is configuring the tracks for the browser, final manual checklist and public release. (F) All of the primary files making up the submission package are listed on this page: the IDF, SDRF, wig and GFF3. Individual files may be replaced, if desired, by the submitting laboratory. (G) A list of active submissions can be displayed separately, providing the user with a snapshot of the vetting status of their submissions.

See this image and copyright information in PMC

References

1. International HapMap Consortium. The International HapMap Project. Nature. 2003;426:789–796. - PubMed
1. Durbin RM, Abecasis GR, Altshuler DL, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
1. Birney E, Stamatoyannopoulos JA, Dutta A, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. - PMC - PubMed
1. C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–2018. - PubMed
1. Adams MD, Celniker SE, Holt RA, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details

Affiliation

The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases