Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2020 May 25;21(1):124.
doi: 10.1186/s13059-020-02038-8.

Personalized and graph genomes reveal missing signal in epigenomic data

Affiliations
Comparative Study

Personalized and graph genomes reveal missing signal in epigenomic data

Cristian Groza et al. Genome Biol. .

Abstract

Background: Epigenomic studies that use next generation sequencing experiments typically rely on the alignment of reads to a reference sequence. However, because of genetic diversity and the diploid nature of the human genome, we hypothesize that using a generic reference could lead to incorrectly mapped reads and bias downstream results.

Results: We show that accounting for genetic variation using a modified reference genome or a de novo assembled genome can alter histone H3K4me1 and H3K27ac ChIP-seq peak calls either by creating new personal peaks or by the loss of reference peaks. Using permissive cutoffs, modified reference genomes are found to alter approximately 1% of peak calls while de novo assembled genomes alter up to 5% of peaks. We also show statistically significant differences in the amount of reads observed in regions associated with the new, altered, and unchanged peaks. We report that short insertions and deletions (indels), followed by single nucleotide variants (SNVs), have the highest probability of modifying peak calls. We show that using a graph personalized genome represents a reasonable compromise between modified reference genomes and de novo assembled genomes. We demonstrate that altered peaks have a genomic distribution typical of other peaks.

Conclusions: Analyzing epigenomic datasets with personalized and graph genomes allows the recovery of new peaks enriched for indels and SNVs. These altered peaks are more likely to differ between individuals and, as such, could be relevant in the study of various human phenotypes.

Keywords: ChIP-seq; De novo assembly; Epigenomics; Genome graphs; Modified reference; Personalized genomes; Reference bias.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
a Two instances of reference bias that could be corrected by a personalized genome. One read is mapped to the incorrect location in the reference genome. The other read is unmapped in the reference genome, but becomes mapped in the personalized genome. b Phased personalized genomes can be implemented in several ways. The reference can be patched with called variants to create a pair of modified personal genomes (MPGs). Alternatively, a sequence graph genome could be augmented with an individual’s alleles (GPG). Finally, the entire personal genomic sequence can be assembled de novo (DPG)
Fig. 2
Fig. 2
a A comparison of the coverage of H3K4me1 peak called regions in hg19 and the maternal MPG. b Identification of peak called regions that have a significant difference in coverage. cQ value distributions of the same H3K4me1 peaks. d NA12878 MPG estimate of the probability that each combination of variation calls present in a region may cause a personal-only peak call compared to their average widths
Fig. 3
Fig. 3
a Proportion of peaks that are called only in personalized MPGs. b Number of peaks with higher coverage in the personalized MPG than in the reference. c Blueprint MPG estimates of the probability that each combination of variation calls present in a region may cause a personal-only peak call compared to their relative average widths. d The probability that a variant affects a peak called on full reads is lower compared to trimmed reads
Fig. 4
Fig. 4
a A comparison of the coverage of peak called regions in the reference and the Hap1 DPG. The smear represents ref-only peaks with no coverage in Hap1. b Identification of peak called regions that have a significant difference in coverage. c Summary of the overlap between altered peaks, confident peaks, repeats, and segmental duplications [58]. d The repeats that overlap altered peaks are enriched in Alu elements relative to their frequency in the RepeatMasker. The categories are chosen by grouping repeats by name prefix, summing their frequencies per group, and taking the largest groups. Remaining groups are labeled as “other.” The control regions are random genomic intervals with a width distribution identical to altered peaks
Fig. 5
Fig. 5
a A comparison of the coverage of H3K4me1 peak called regions in the reference and the graph genome. Pairwise overlaps between MPG, DPG, and GPG H3K4me1 peak tracks. b Identification of peak called regions that have a significant difference in coverage. c Overlap of all peak calls. d Overlap of altered personal-only peak calls. e Overlap of ref-only peak calls. f Empirical null distributions for the overlap of personal-only peaks between personal genome implementations
Fig. 6
Fig. 6
a Comparison of altered peak q values between MPG, GPG, and DPG implementations by rank. The top n peak subset was increased by 5 peak increments. b Distribution of gene relative positions of personal-only peaks among all genomes. Personal-only and common peaks replicated in at least two genomes are also featured. c The pileup of a GPG-only peak projected to the hg19 linear reference. d The true graph rendering of the above AP in the NA12878 GPG and reference genome graph

References

    1. Bourgey M, Dali R, Eveleigh R, Chen KC, Letourneau L, Fillon J, et al.GenPipes: an open-source framework for distributed and scalable genomic analyses. GigaScience. 2019; 8(6). Available from: 10.1093/gigascience/giz037. - PMC - PubMed
    1. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004; 306(5696):636. Available from: http://science.sciencemag.org/content/306/5696/636.abstract. - PubMed
    1. The 1000 Genomes Project Consortium, Auton A, Abecasis GR, Altshuler (Co-Chair) DM, Durbin (Co-Chair) RM, Abecasis GR, et al.A global reference for human genetic variation. Nature. 2015; 526:68. Available from: 10.1038/nature15393. - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinforma Oxf Engl. 2009;25(14):1754–60. doi: 10.1093/bioinformatics/btp324. - DOI - PMC - PubMed
    1. Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36:875. doi: 10.1038/nbt.4227. - DOI - PMC - PubMed

Publication types

LinkOut - more resources