Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 1:2018:bay008.
doi: 10.1093/database/bay008.

Prevention of data duplication for high throughput sequencing repositories

Affiliations

Prevention of data duplication for high throughput sequencing repositories

Idan Gabdank et al. Database (Oxford). .

Abstract

https://www.encodeproject.org/.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The metadata captured for ENCODE can be grouped into the following main object types: donors/strains, biosamples, genetic modifications, sequencing libraries, antibodies, data files and pipelines. Experiment objects (representing replicates of an assay) are constructed from these object types. Each object type represents a category of experimental entities and is used to store information about entities from that category. For example, the library object represents sequencing library and would include information such as nucleic acid type or the fragmentation method used to construct the library. In a similar fashion the file object (that is different from the actual data file) stores information about the data file submitted to the portal. Examples of the properties that would be stored in a file object would be: information about the sequencing platform used to produce this FASTQ file or information about the reference genome assembly that was used for alignment producing this BAM file. Some of the objects are unique per experiment (e.g. sequencing library, or raw data file) while others could be shared between different experiment objects (e.g. the donor or the biosample objects). The figure includes both types of objects, the library (blue color header) and the files (yellow color header) are unique and are associated with a single experiment (amber header), while the biosample (green color header) and the pipeline (pink color header) objects could be shared between multiple experiments. Potential experiments the biosample and the pipeline objects could be shared with are depicted by the rectangles with the dashed borderline. Only a subset of object types is listed in the figure to provide an overview of the breadth and depth of metadata collected. The full set of metadata can be viewed at https://github.com/ENCODE-DCC/encoded/tree/master/src/encoded/schemas.
Figure 2.
Figure 2.
Example of an antibody page for antibody lot (ENCAB823XVS). Various aspects of metadata are displayed on the antibody page, including the properties that are used for antibody uniqueness validation. The set includes the lot identifier, product identifier and the source (vendor) name.
Figure 3.
Figure 3.
The chart presents the breakdown of the total number (2941 as of 10/13/2017) of FASTQ file duplication events detected in our database using different methods. Each successive bar shows the split between the number of files that could be detected with the methods from the previous bar (depicted in a green color) as well as those that could only be detected with the additional method (depicted in an orange color). The final 285 FASTQ files were considered duplicates based on manual curation.
Figure 4.
Figure 4.
Types of FASTQ file duplications detectable using FASTQ signature heuristic. It is important to note that all the cases presented here are not detectable using MD5 hash or content MD5 hash function approaches, as those functions’ results will be different from the original FASTQ file for all listed files. File A represents a case where reads from original file are out of order. FASTQ signature heuristic would detect duplication of this type. File B is identical to the original file, except for a small change in read names making detection of file content duplication challenging. Since FASTQ signatures are constructed using only parts of the read name, the ability of the heuristic to detect duplication will rely on the exact places the read names were modified. File C contains subset of the reads from original file and will be detected by FASTQ signature heuristic. File D contains reads not present in original file; however, it will be reported as a potential duplication because it contains reads identical to the content of original FASTQ file. File E will be reported as duplication of the original FASTQ file, since Read_1 and Read_2 appear in both files. However, File E contains both internal duplication and external duplication of the original FASTQ file. The internal duplication is not detectable using our current FASTQ signature approach.
Figure 5.
Figure 5.
Representation of FASTQ file content by FASTQ signature. FASTQ signature is constructed using read name parts that are common for multiple reads within a single FASTQ file. Read name parts that are used for signature construction are color coded in the figure: flowcell identifier (yellow), flowcell lane number (green), read 1 or read 2 (turquoise) and index sequence (grey). Our condensation approach allows representation of multiple reads in FASTQ file by a single FASTQ signature, as it is exemplified in the figure.
Figure 6.
Figure 6.
Example of our de-duplication mechanism. BAM files ENCFF223NOW and ENCFF774KAG were found to be data objects representing the same file. To resolve this situation of file objects duplication, ENCFF223NOW was deprecated (status changed to ‘replaced’) and the accession ENCFF223NOW was added to the list of alternate accessions of the file ENCFF774KAG. Searches for ENCFF223NOW are automatically redirected to the file ENCFF74KAG.

References

    1. Barrett T., Wilhite S.E., Ledoux P.. et al. (2013) NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res., 41, D991–D995. - PMC - PubMed
    1. Barrett T., Troup D.B., Wilhite S.E.. et al. (2011) NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res., 39, D1005–D1010. - PMC - PubMed
    1. Hong E.L., Sloan C.A., Chan E.T.. et al. (2016) Principles of metadata organization at the ENCODE data coordination center. Database, 2016, 1–10. - PMC - PubMed
    1. Bernstein B.E., Stamatoyannopoulos J.A., Costello J.F.. et al. (2010) The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol., 28, 1045–1048. - PMC - PubMed
    1. Washington N.L., Stinson E.O., Perry M.D.. et al. (2011) The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details. Database, 2011, bar023. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources