. 2024 May 16;20(5):e1012095.

doi: 10.1371/journal.pcbi.1012095. eCollection 2024 May.

Interpretable online network dictionary learning for inferring long-range chromatin interactions

Vishal Rana¹, Jianhao Peng¹, Chao Pan¹, Hanbaek Lyu², Albert Cheng³, Minji Kim⁴, Olgica Milenkovic¹

Affiliations

¹ Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America.
² Department of Mathematics, University of Wisconsin - Madison, Madison, Wisconsin, United States of America.
³ School of Biological and Health Systems Engineering, Arizona State University, Phoenix, Arizona, United States of America.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America.

PMID: 38753877
PMCID: PMC11135774
DOI: 10.1371/journal.pcbi.1012095

Interpretable online network dictionary learning for inferring long-range chromatin interactions

Vishal Rana et al. PLoS Comput Biol. 2024.

. 2024 May 16;20(5):e1012095.

doi: 10.1371/journal.pcbi.1012095. eCollection 2024 May.

Authors

Vishal Rana¹, Jianhao Peng¹, Chao Pan¹, Hanbaek Lyu², Albert Cheng³, Minji Kim⁴, Olgica Milenkovic¹

Affiliations

¹ Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America.
² Department of Mathematics, University of Wisconsin - Madison, Madison, Wisconsin, United States of America.
³ School of Biological and Health Systems Engineering, Arizona State University, Phoenix, Arizona, United States of America.
⁴ Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America.

PMID: 38753877
PMCID: PMC11135774
DOI: 10.1371/journal.pcbi.1012095

Abstract

Dictionary learning (DL), implemented via matrix factorization (MF), is commonly used in computational biology to tackle ubiquitous clustering problems. The method is favored due to its conceptual simplicity and relatively low computational complexity. However, DL algorithms produce results that lack interpretability in terms of real biological data. Additionally, they are not optimized for graph-structured data and hence often fail to handle them in a scalable manner. In order to address these limitations, we propose a novel DL algorithm called online convex network dictionary learning (online cvxNDL). Unlike classical DL algorithms, online cvxNDL is implemented via MF and designed to handle extremely large datasets by virtue of its online nature. Importantly, it enables the interpretation of dictionary elements, which serve as cluster representatives, through convex combinations of real measurements. Moreover, the algorithm can be applied to data with a network structure by incorporating specialized subnetwork sampling techniques. To demonstrate the utility of our approach, we apply cvxNDL on 3D-genome RNAPII ChIA-Drop data with the goal of identifying important long-range interaction patterns (long-range dictionary elements). ChIA-Drop probes higher-order interactions, and produces data in the form of hypergraphs whose nodes represent genomic fragments. The hyperedges represent observed physical contacts. Our hypergraph model analysis has the objective of creating an interpretable dictionary of long-range interaction patterns that accurately represent global chromatin physical contact maps. Through the use of dictionary information, one can also associate the contact maps with RNA transcripts and infer cellular functions. To accomplish the task at hand, we focus on RNAPII-enriched ChIA-Drop data from Drosophila Melanogaster S2 cell lines. Our results offer two key insights. First, we demonstrate that online cvxNDL retains the accuracy of classical DL (MF) methods while simultaneously ensuring unique interpretability and scalability. Second, we identify distinct collections of proximal and distal interaction patterns involving chromatin elements shared by related processes across different chromosomes, as well as patterns unique to specific chromosomes. To associate the dictionary elements with biological properties of the corresponding chromatin regions, we employ Gene Ontology (GO) enrichment analysis and perform multiple RNA coexpression studies.

Copyright: © 2024 Rana et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1**
(a) Workflow of the dictionary learning method. Multiway (multiplexed) chromatin interactions represented as hyperedges are *clique expanded* into standard networks and combined to create input networks for the algorithm. MCMC subnetwork sampling is then used to generate samples for initialization and online updates during iterative optimization of the objective function, resulting in convex dictionary elements. (b) Illustration of the clique expansion process. Hyperedges are subsets of indexed nodes shaded with the same color. (c) Illustration of clique expansion distortion. There is no hyperedge including nodes 3, 5, and 8 (colored red), and this 3-clique only exists due to shared nodes/edges of “real” hyperedges. Such distortion is negligible when the number of large hyperedges is limited. (d) Subnetwork sampling and the notion of a *motif homomorphism*. These correspond to subnetworks of the input network induced by a fixed number of nodes that contain a template motif topology. The set of homomorphisms $Hom (F, G)$ for a network $G$ and the template network F are defined in the Methods Section (Eq 7). Also depicted are an example homomorphism $\underline{x} \in Hom (F, G)$ and its induced adjacency matrix $A_{\underline{x}}$ for an input network $G$ with 9 nodes. The template F is a star network on 4 nodes. In the adjacency matrix, a black field indicates 1, while a white field indicates 0. (e) Workflow of the MCMC sampling algorithm for path homomorphisms. Given a sample ${\underline{x}}_{t}$ at time t, obtained via a directed random walk from an initial state in the input network, ${\underline{x}}_{t} [1]$ , we generate a sample ${\underline{x}}_{t + 1}$ at time t + 1 by choosing uniformly at random a node v from the neighborhood of ${\underline{x}}_{t} [1]$ (marked in green) and calculating a probability of acceptance β. If node v is accepted, we initiate a new directed random walk from v, otherwise, we restart a directed random walk from ${\underline{x}}_{t} [1]$ to generate a new sample.

**Fig 2**
(a) Organization of a dictionary comprising K dictionary elements that are convex combinations of real representative subnetworks. Each dictionary element itself is a sparse *convex combination* of a set of representatives which are small subnetworks of the input real-world network. In the example, there are 6 options for the representatives, and inclusion of a representative into a dictionary element is indicated by a colored entry in a 6-dimensional indicator column-vector. Each of the 6 representatives corresponds to a subnetwork of the input network with a fixed number of nodes (3 for our example). The dictionary element is generated by a convex combination of the corresponding adjacency matrices of its corresponding representative subnetworks. For the example, the resulting dictionary elements are 9 × 9 matrices. (b) Illustration of the representative region update. When an online data sample is observed, the distance of the sample to each of the current dictionary elements is computed and the sample is assigned to the representative region of the nearest dictionary element. From this expanded set of representatives, one representative is carefully selected for removal to improve the objective. The new dictionary element is then obtained as an optimized convex combination of the updated set of representatives.

**Fig 3. Generation of ChIA-Drop data.**
ChIA-Drop [17] adopts a droplet-based barcode-linked technique to reveal multiway chromatin interactions at a single molecule level. Chromatin samples are crosslinked and fragmented without a proximity ligation step. The samples are enriched for informative fragments through antibody pull-down.

**Fig 4. A dictionary element, represented as a matrix, consists of both proximal and distal interacting genomic regions.**
The elements on the diagonal are not necessarily indexed by adjacent (consecutive) genomic fragments, as explained by the example in the second row. There, off-diagonal long-range interactions in the 9 × 9 matrix are included in a 3 × 3 dictionary element whose diagonal elements are not in consecutive order.

**Fig 5. Dictionary elements for *Drosophila* chromosomes 2L, 2R, 3L and 3R obtained using online cvxNDL.**
Each subplot contains 25 dictionary elements for the corresponding chromosome and each block in the subplots corresponds to one dictionary element. The elements are ordered by their importance score. Note that the “diagonals” in the dictionary elements do not exclusively represent localized topologically associated domains (TADs) as in standard chromatin contact maps; instead, they can also capture long-range interactions. This is due to the fact that the indices of the dictionary element matrices represent genomic regions that may be far apart in the genome. In contrast, standard contact maps have indices that correspond to continuously ordered genomic regions, so that the diagonals truly represent TADs (see Fig 4). The color-code captures the actual locations of the genomic regions involved in the representatives and their dictionary elements. The most interesting dictionary elements are those that contain both dark blue, light blue/green, and red colors (since they involve long-range interactions). This is especially the case for chr3L and chr3R.

**Fig 6. Dictionary elements for *Drosophila* chromosome chr2L generated by (A) NMF, (B) online NDL, (C) CMF and (D) online cvxNDL.**
NMF and CMF are learned off-line, using a total of 20, 000 samples. Note that these algorithms do not scale and cannot work with larger number of samples such as those used in online cvxNDL. The color-coding is performed in the same manner as for the accompanying online cvxNDL results. Columns of the dictionary elements in the second row are color-coded based on the genome locations of the representatives. As biologically meaningful locations can be determined only via convex methods, the top row corresponding to NMF and online NDL results is black-and-white.

**Fig 7. Original adjacency matrix and reconstructed adjacency matrices based on different DL methods, for an example Stochastic Block Model (SBM), including randomly selected dictionary elements.**
Both the x and y axes in the figures index the nodes of the synthetic network generated by the stochastic block model (SBM). The nodes are reorganized to highlight the underlying community structure. For a more quantitative analytical accuracy comparisons, see Table 1.

**Fig 8. The 5 most enriched GO terms for genes covered by dictionary elements from chr3R.**
Column ‘#’ indicates the number of dictionary elements that show enrichment for the given GO term. Also reported are up to 3 dictionary elements with the largest importance score in the dictionary, along with the “density” ρ of interactions in the dictionary element (defined in the Methods section) and median distance d_med of all adjacent pairs of nodes in its representatives.

Fig 9. Empirical cumulative distribution functions (ECDF) of mean pairwise coexpressions of genes covered by random and online cvxNDL dictionary elements ((a) for chr2L, (b) for chr2R, (c) for chr3L and (d) for chr3R).
The results are based on the two-sample Kolmogorov-Smirnov test, and the null hypothesis described in the main text.

Fig 10. Pairwise coexpression of genes covered by (a) the R1-R4 genomic regions, (b) the T1-T4 genomic regions, (c) an online cvxNDL dictionary element, and (d) a randomly constructed dictionary element.
We calculated the mean and standard deviation of absolute pairwise coexpression values, and the mean and standard deviation of coexpression values specifically for all positively correlated gene pairs. The mean coexpression values within TADs and dictionary elements are similar to each other and generally higher than those of randomly constructed dictionary elements. The x and y axis index genes that belong to the respective TAD regions or a specific dictionary element. Note that the plot (b) is of coarser resolution due to the small number of genes covered when compared to the cases (a), (c), (d).

See this image and copyright information in PMC

References

1. Elad M, Aharon M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing. 2006;15(12):3736–3745. doi: 10.1109/TIP.2006.881969 - DOI - PubMed
1. Mairal J, Elad M, Sapiro G. Sparse representation for color image restoration. IEEE Transactions on image processing. 2007;17(1):53–69. doi: 10.1109/TIP.2007.911828 - DOI - PubMed
1. Cichocki A, Lee H, Kim YD, Choi S. Non-negative matrix factorization with α-divergence. Pattern Recognition Letters. 2008;29(9):1433–1440. doi: 10.1016/j.patrec.2008.02.016 - DOI
1. Ye M, Qian Y, Zhou J. Multitask sparse nonnegative matrix factorization for joint spectral–spatial hyperspectral imagery denoising. IEEE Transactions on Geoscience and Remote Sensing. 2014;53(5):2621–2639. doi: 10.1109/TGRS.2014.2363101 - DOI
1. Lu H, Sang X, Zhao Q, Lu J. Community detection algorithm based on nonnegative matrix factorization and pairwise constraints. Physica A: Statistical Mechanics and its Applications. 2020;545:123491. doi: 10.1016/j.physa.2019.123491 - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Interpretable online network dictionary learning for inferring long-range chromatin interactions

Affiliations

Interpretable online network dictionary learning for inferring long-range chromatin interactions

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases