. 2022 Jun 24;38(Suppl 1):i203-i211.

doi: 10.1093/bioinformatics/btac255.

Markov chains improve the significance computation of overlapping genome annotations

Askar Gafurov¹, Broňa Brejová¹, Paul Medvedev^{2

3

4}

Affiliations

¹ Department of Computer Science, Comenius University, Bratislava 84248, Slovakia.
² Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
³ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.
⁴ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.

PMID: 35758770
PMCID: PMC9235476
DOI: 10.1093/bioinformatics/btac255

Markov chains improve the significance computation of overlapping genome annotations

Askar Gafurov et al. Bioinformatics. 2022.

. 2022 Jun 24;38(Suppl 1):i203-i211.

doi: 10.1093/bioinformatics/btac255.

Authors

Askar Gafurov¹, Broňa Brejová¹, Paul Medvedev^{2

3

4}

Affiliations

¹ Department of Computer Science, Comenius University, Bratislava 84248, Slovakia.
² Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
³ Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA.
⁴ Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA.

PMID: 35758770
PMCID: PMC9235476
DOI: 10.1093/bioinformatics/btac255

Abstract

Motivation: Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome.

Results: We show that finding the P-values under the typically used 'gold' null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy.

Availability and implementation: The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
An example illustrating the reduction in Theorem 1 from an instance of the multiprocessor scheduling problem to an instance of the P-value problem under $H_{0}^{G S}$ . The scheduling instance has $\vec{a} = {2, 2, 2, 2, 4, 7}$ , N = 6, b = 3 and w = 7. The resulting instance of the P-value problem has the length of the genome as L = 22, the reference interval set as $R = {[6, 8), [14, 16)}$ , and the multiset of query interval lengths as $lens (Q) = {1, 1, 1, 1, 3, 6}$ . The figure shows R and an annotation $Q'$ with $lens (Q') = lens (Q)$ and no overlap with the reference (i.e. $K (R, Q') = 0$ ). The annotation $Q'$ corresponds to partitioning $\vec{a}$ as ${{4, 2}, {7}, {2, 2, 2}}$ , which is a valid solution to the scheduling problem instance

**Fig. 2.**
P-values for the real datasets from Sarmashghi and Bafna (2019), shown on a negative log scale. On all datasets except EC, the sampled $H_{0}^{G S}$ is not shown because every one of the 10 000 samples had less than K(R, Q) overlaps; the *rule of three* (Burns, 1983) implies that in this case, with 10 000 samples, the true P-value can only be said to be less than $P < 3 \times 10^{- 4}$ with 95% probability

**Fig. 3.**
Probability mass functions for the overlap statistics in the four datasets from Sarmashghi and Bafna (2019). The vertical lines show the critical values for significance level 0.05. Note that the x-axis is zoomed in around the main probability mass. The critical value is the smallest value of k for which $Pr [K (R, Q) \geq k] \leq 0.05$ . Supplementary Table S3 contains the raw critical values

**Fig. 4.**
Replication of Zarrei *et al.* (2015; Fig. 3b) analysis, where we use copy number losses as a reference and each of the columns as a separate query. The columns correspond to various subset of exons from RefSeq human genes; the exact descriptions are in Supplementary Table S4. The two panels show P-values for enrichment and depletion of copy number losses in the corresponding gene column. The blue dots represent the P-values produced by MCDP, the green boxes represent the P-value range reported by Zarrei *et al.* (2015). We clip the y-axis at the top, and the points and ranges at those boundaries actually have more extreme values. The orange dashed lines (lower) correspond to significance level 0.05, and the red dashed lines (upper) correspond to Bonferroni-adjusted significance level (i.e. $\frac{0.05}{13}$ , since there are 13 experiments in this case). Note that MCDP can report the P-value for both enrichment and depletion because it computes the probability mass function for all values of K(R, Q) (A color version of this figure appears in the online version of this article)

See this image and copyright information in PMC

References

1. Bartel D.P. (2009). MicroRNAs: target recognition and regulatory functions. Cell, 136(2), 215–233. - PMC - PubMed
1. Burns J. (1983). If nothing goes wrong, is everything all right? Why we should be wary of zero numerators. J. Am. Med. Assoc., 249(13), 1743–1745. - PubMed
1. Chikina M.D., Troyanskaya O.G. (2012). An effective statistical evaluation of ChipSeq dataset similarity. Bioinformatics, 28(5), 607–613. - PMC - PubMed
1. Coarfa C. et al. (2014). Analysis of interactions between the epigenome and structural mutability of the genome using Genboree workbench tools. BMC Bioinformatics, 15(Suppl 7), 1–12. - PMC - PubMed
1. Devroye L. (1986). Non-Uniform Random Variate Generation. Springer, New York.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Markov chains improve the significance computation of overlapping genome annotations

Affiliations

Markov chains improve the significance computation of overlapping genome annotations

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Miscellaneous