Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Nov 10:2023.11.08.566275.
doi: 10.1101/2023.11.08.566275.

Improving Hi-C contact matrices using genome graphs

Affiliations

Improving Hi-C contact matrices using genome graphs

Yihang Shen et al. bioRxiv. .

Abstract

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipeline involves mapping sequencing reads from Hi-C to linear reference genomes. However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices. These enhanced matrices show a reduction in erroneous patterns caused by structural variations and are more effective in accurately capturing the structures of topologically associated domains.

PubMed Disclaimer

Figures

Figure S1:
Figure S1:
Four components of our graph pruning algorithm. Each red line represents one end of a read pair, and each blue line represents the other end.
Figure S2:
Figure S2:
An example of converting G (a) to G (b) in the proof of Theorem 1.
Figure S3:
Figure S3:
An example of generating a DAG (b) from a cubic graph instance (a) in the proof of Theorem 5.
Figure S4:
Figure S4:
A worst-case example of the heuristic algorithm
Figure S5:
Figure S5:
The region between 77, 000, 000bp and 102, 000, 000bp in chromosome 13. The genome graph shows a large deletion (s2) with an approximate length of 9, 000, 000bp.
Figure S6:
Figure S6:
The region between 87, 000, 000bp and 112, 000, 000bp in chromosome 13. The genome graph shows a large deletion (s2) with an approximate length of 15, 000, 000bp.
Figure S7:
Figure S7:
The region between 0bp and 23, 000, 000bp in chromosome 18. The genome graph shows a large deletion (s2) with an approximate length of 20, 000, 000bp. The empty stripe is the centromere region.
Figure S8:
Figure S8:
TADs in the same region as Figure 2. TADs are called by Armatus with hyper-parameter γArmatus=0.5.
Figure S9:
Figure S9:
TADs in the same region as Figure S5. TADs are called by Armatus with hyper-parameter γArmatus=0.5.
Figure S10:
Figure S10:
TADs in the same region as Figure 3. TADs are called by Armatus with hyper-parameter γArmatus=0.5.
Figure S11:
Figure S11:
Three regions of contact matrices generated by the longest M-weighted path algorithm, corresponding to regions shown in Figure 3, Figure 2, and Figure S5.
Figure 1:
Figure 1:
The workflow of our graph-based Hi-C processing pipeline. Each red line represents one end of a read pair, and each blue line represents the other end.
Figure 2:
Figure 2:
The region between 155, 000, 000bp and 170, 000, 000bp in chromosome 4. The genome graph shows a large deletion (s2) with an approximate length of 3, 000, 000bp. Note that besides this large deletion, the genome graph also contains numerous other structural variations within this region. These are not shown in the plot for the sake of clearer visualization.
Figure 3:
Figure 3:
The region between 57, 000, 000bp and 62, 000, 000bp in chromosome 3. The genome graph shows two deletions (s2 and s4) with approximate lengths of 300, 000bp and 150, 000bp respectively.
Figure 4:
Figure 4:
(a),(b) CTCF peak signals around TAD boundaries from the linear reference genome (a) and the inferred linear genome (b). (c),(d) SMC3 peak signals around TAD boundaries from the linear reference genome (c) and the inferred linear genome (d). TADs are called by Armatus with hyperparameter γ=0.5.

References

    1. Fraser Peter and Bickmore Wendy. Nuclear organization of the genome and the potential for gene regulation. Nature, 447(7143):413–417, 2007. - PubMed
    1. Rennie Sarah, Dalby Maria, van Duin Lucas, and Andersson Robin. Transcriptional decomposition reveals active chromatin architectures and cell specific regulatory interactions. Nature Communications, 9(1):487, 2018. - PMC - PubMed
    1. Grewal Shiv IS and Moazed Danesh. Heterochromatin and epigenetic control of gene expression. Science, 301(5634):798–802, 2003. - PubMed
    1. Pope Benjamin D, Ryba Tyrone, Dileep Vishnu, Yue Feng, Wu Weisheng, Denas Olgert, Vera Daniel L, Wang Yanli, Hansen R Scott, Canfield Theresa K, et al. Topologically associating domains are stable units of replication-timing regulation. Nature, 515(7527):402–405, 2014. - PMC - PubMed
    1. Lieberman-Aiden Erez, Van Berkum Nynke L, Williams Louise, Imakaev Maxim, Ragoczy Tobias, Telling Agnes, Amit Ido, Lajoie Bryan R, Sabo Peter J, Dorschner Michael O, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science, 326(5950):289–293, 2009. - PMC - PubMed

Publication types