Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 14;19(1):481.
doi: 10.1186/s12859-018-2438-1.

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

Affiliations

Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

Diana Domanska et al. BMC Bioinformatics. .

Abstract

Background: The current versions of reference genome assemblies still contain gaps represented by stretches of Ns. Since high throughput sequencing reads cannot be mapped to those gap regions, the regions are depleted of experimental data. Moreover, several technology platforms assay a targeted portion of the genomic sequence, meaning that regions from the unassayed portion of the genomic sequence cannot be detected in those experiments. We here refer to all such regions as inaccessible regions, and hypothesize that ignoring these regions in the null model may increase false findings in statistical testing of colocalization of genomic features.

Results: Our explorative analyses confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps of human reference genomes (hg19 and hg38). The little intersection was observed only at the beginning and end portions of the gap regions. Further, we simulated a set of synthetic tracks by matching the properties of real genomic tracks in a way that nullified any true association between them. This allowed us to test our hypothesis that not avoiding inaccessible regions (as represented by assembly gaps) in the null model would result in spurious inflation of statistical significance. We contrasted the distributions of test statistics and p-values of Monte Carlo-based permutation tests that either avoided or did not avoid assembly gaps in the null model when testing colocalization between a pair of tracks. We observed that the statistical tests that did not account for assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribution of p-values that is shifted to the left (indicating inflated significance). We observed a similar level of inflated significance in hg19 and hg38, despite assembly gaps covering a smaller proportion of the latter reference genome.

Conclusion: We provide empirical evidence demonstrating that inaccessible regions, even when covering only a few percentages of the genome, can lead to a substantial amount of false findings if not accounted for in statistical colocalization analysis.

Keywords: Assembly gaps; BED format; Co-occurrence analysis; Colocalization analysis; Genomic overlap analysis; Reference genome; Region set enrichment analysis; Statistical genome analysis.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Distribution of the test statistic and p-values of colocalization analysis for a collection of 477 genomic tracks with 2113.82 bp average segment length for histone modifications (hg19). [a and b shows the distribution of p-values of the colocalization analysis with (left) and without (right) exclusion of assembly gap regions under the null model. c and d shows the observed test statistic and the average test statistic of the same tracks with (left) and without (right) exclusion of assembly gap regions under the null model. Note: Both values are higher than 1 because the computations were performed relative to the whole genome size.]
Fig. 2
Fig. 2
Relation between the p-values of colocalization analysis for a collection of N genomic tracks and the number of elements within each track of hg19 a for histone modifications (N=477) b for TFBS in K562 (N=568)
Fig. 3
Fig. 3
Schematic showing the study design. [To demonstrate the assembly gap bias if not accounted for in the statistical testing, we used two null model definitions that only differed in whether or not they avoided assembly gaps. For the pairwise colocalization analysis, we deliberately used a combination of real and synthetic track pairs to nullify any true biological association between them. The synthetic tracks were generated in such a way that they mimick the real tracks in terms of the genomic distributional properties as shown in the ellipse. The distributions of p-values, observed colocalization measures, average colocalization measures under the null models were examined to see if or not there is a bias.]

References

    1. International Human Genome Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–45. doi: 10.1038/nature03001. - DOI - PubMed
    1. The ENCODE Project Consortium An integrated encyclopedia of dna elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. Lander ES. Initial impact of the sequencing of the human genome. Nature. 2011;470(7333):187–97. doi: 10.1038/nature09792. - DOI - PubMed
    1. Treangen TJ, Salzberg SL. Repetitive dna and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2011;13(1):36–46. doi: 10.1038/nrg3117. - DOI - PMC - PubMed
    1. Schneider VA, Graves-Lindsay T, Howe K, Bouk N, Chen H-C, et al. Evaluation of grch38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849–64. doi: 10.1101/gr.213611.116. - DOI - PMC - PubMed