Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar 15:2016:baw001.
doi: 10.1093/database/baw001. Print 2016.

Principles of metadata organization at the ENCODE data coordination center

Affiliations

Principles of metadata organization at the ENCODE data coordination center

Eurie L Hong et al. Database (Oxford). .

Abstract

The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) is responsible for organizing, describing and providing access to the diverse data generated by the ENCODE project. The description of these data, known as metadata, includes the biological sample used as input, the protocols and assays performed on these samples, the data files generated from the results and the computational methods used to analyze the data. Here, we outline the principles and philosophy used to define the ENCODE metadata in order to create a metadata standard that can be applied to diverse assays and multiple genomic projects. In addition, we present how the data are validated and used by the ENCODE DCC in creating the ENCODE Portal (https://www.encodeproject.org/). Database URL: www.encodeproject.org.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Major categories of metadata. The metadata captured for ENCODE can be grouped into the following major areas: biosamples and donors/strains (formerly ‘cell types’), libraries, antibodies, data files and pipelines and software. These categories are then grouped into an experiment with replicates. Only a subset of metadata is listed in the figure to provide an overview of the breadth and depth of metadata collected for an assay. The full set of metadata can be viewed at https://github.com/ENCODE-DCC/encoded/tree/master/src/encoded/schemas.
Figure 2
Figure 2
Accessions listed for an experiment on the ENCODE Portal. (A) An experiment page will contain accessions for the experiment referring to the full set of metadata describing how the assay was performed and the data generated by the assay, for the specific antibody lot used in that experiment, for the library that was generated, for the biosample that was used as input to the experiment, and for each data file generated by sequencing. (B) A biosample page will contain accessions for the biosample that was used as well as the unique donor (or strain) that provided the sample.
Figure 3
Figure 3
Schematic of the metadata model. The metadata model reflects how researchers perform laboratory and computational experiments. A single experiment can contain one or more replicates (see text). These replicates generate raw data files, which are then used in software and data processing pipelines to generate processed data files. Control experiments can be modeled similarly to experiments. Files from multiple experiments can be used as input for a single pipeline run.
Figure 4
Figure 4
Categories in the metadata model are linked to each other. Categories of metadata are linked to each other and can be described by relationships between the categories. Each individual category can be referred to multiple times. For example, a liver and a brain can be obtained from the same donor. In addition, a single biosample, like the liver, can be used as input for multiple assays. Because each donor and biosample is accessioned, they can be referred to uniquely.
Figure 5
Figure 5
Example of an enumerated list in the schema. The metadata model is represented as a JSON object (this computational structure is the metadata data model) containing properties of specific metadata fields. An enumerated list is a list of allowed values for that property. It prevents typos or multiple spellings of a single item to maintain consistent data. Values added for this property are checked against the list when the data are added.
Figure 6
Figure 6
Metadata validation in the schema. The schema allows dependencies which allow conditions to be defined on which set of data should be submitted. In this example, the dependency states that the paired files from paired-end sequencing runs need to be explicitly defined. This prevents paired-end files from being separated from each other as the data are submitted.
Figure 7
Figure 7
Validation of metadata using audits. The top half of the panel is a screenshot of the metadata-drive facets that can be used for browsing data. The bottom panel is a screenshot of a data audit that is visible to data submitters and the DCC. It includes a list of queries that are performed for inconsistent or incorrect metadata. These audits ensure that the metadata are accurate before data release.

References

    1. ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature, 489, 57–74. - PMC - PubMed
    1. Yue F., Cheng Y., Breschi A, et al. (2014) A comparative encyclopedia of DNA elements in the mouse genome. Nature, 515, 355–364. - PMC - PubMed
    1. Birney E., Stamatoyannopoulos J.A. ENCODE Project Consortium, et al. (2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447, 799–816. - PMC - PubMed
    1. Sloan C.A., Chan E.T., Davidson J.M, et al. (2015) ENCODE data at the ENCODE portal. Nucleic Acids Res., 44, D726–D732. doi:10.1093/nar/gkv1160. - PMC - PubMed
    1. Washington N.L., Stinson E.O., Perry M.D, et al. (2011) The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details. Database, 2011, bar023. - PMC - PubMed

Publication types