Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 14:46:108827.
doi: 10.1016/j.dib.2022.108827. eCollection 2023 Feb.

Comprehensive 100-bp resolution genome-wide epigenomic profiling data for the hg38 human reference genome

Affiliations

Comprehensive 100-bp resolution genome-wide epigenomic profiling data for the hg38 human reference genome

Ronnie Y Li et al. Data Brief. .

Abstract

This manuscript presents a comprehensive collection of diverse epigenomic profiling data for the human genome in 100-bp resolution with full genome-wide coverage. The datasets are processed from raw read count data collected from five types of sequencing-based assays collected by the Encyclopedia of DNA Elements consortium (ENCODE, http://www.encodeproject.org). Data from high-throughput sequencing assays were processed and crystallized into a total of 6,305 genome-wide profiles. To ensure the quality of the features, we filtered out assays with low read depth, inconsistent read counts, and poor data quality. The types of sequencing-based experiment assays include DNase-seq, histone and TF ChIP-seq, ATAC-seq, and Poly(A) RNA-seq. Merging of processed data was done by averaging read counts across technical replicates to obtain signals in about 30 million predefined 100-bp bins that tile the entire genome. We provide an example of fetching read counts using disease-related risk variants from the GWAS Catalog. Additionally, we have created a tabix index enabling fast user retrieval of read counts given coordinates in the human genome. The data processing pipeline is replicable for users' own purposes and for other experimental assays. The processed data can be found on Zenodo at https://zenodo.org/record/7015783. These data can be used as features for statistical and machine learning models to predict or infer a wide range of variables of biological interest. They can also be applied to generate novel insights into gene expression, chromatin accessibility, and epigenetic modifications across the human genome. Finally, the processing pipeline can be easily applied to data from any other genome-wide profiling assays, expanding the amount of available data.

Keywords: ATAC-seq, assay for transposase-accessible chromatin with sequencing; Bioinformatics; ChIP-seq, chromatin immunoprecipitation followed by sequencing; DNase-seq, DNase I hypersensitive site assay with sequencing; ENCODE; ENCODE, Encyclopedia of DNA Elements; EWAS, epigenome-wide association study; Epigenomics; GWAS, genome-wide association study; Genomics; High-throughput sequencing; TF, transcription factor; gnomAD, Genome Aggregation Database.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig 1
Fig. 1
Schematic of data collection process and format of data. (a) Raw read counts from sequencing-based assays are imported as .bam files. (b) Each bam file contains a multitude of reads covering specific genomic intervals. We calculated the number of reads that overlap each pre-defined 100-bp window and saved these counts as compressed .tsv files. (c) Processed read counts are in tabular format, with rows representing the genomic intervals and columns constituting the experimental accession numbers. Each accession number represents the sequencing experiment of a biological target sample done in a specific cell line.

References

    1. Consortium The ENCODE Project. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. R. Li, Y. Huang, Z.S. Qin, Comprehensive 100-bp resolution genome-wide epigenomic profiling data for the hg38 human reference genome, V1.0, 2022[dataset]. doi: 10.5281/zenodo.7015783. - DOI - PMC - PubMed
    1. Chen L., Jin P., Qin Z.S. DIVAN: accurate identification of non-coding disease-specific risk variants using multi-omics profiles. Genome Biol. 2016;17(1):252. doi: 10.1186/s13059-016-1112-z. - DOI - PMC - PubMed
    1. Cao Z., Huang Y., Duan R., Jin P., Qin Z.S., Zhang S. Disease category-specific annotation of variants using an ensemble learning framework. Brief Bioinform. 2022;23(1) doi: 10.1093/bib/bbab438. - DOI - PubMed
    1. Huang Y., Sun X., Jiang H., Yu S., Robins C., Armstrong M.J., Li R., Mei Z., Shi X., Gerasimov E.S., De Jager P.L., Bennett D.A., Wingo A.P., Jin P., Wingo T.S., Qin Z.S. A machine learning approach to brain epigenetic analysis reveals kinases associated with Alzheimer's disease. Nat. Commun. 2021;12(1):4472. doi: 10.1038/s41467-021-24710-8. - DOI - PMC - PubMed

LinkOut - more resources