Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Oct 15;22(4):qzae048.
doi: 10.1093/gpbjnl/qzae048.

The Bioinformatic Applications of Hi-C and Linked Reads

Affiliations
Review

The Bioinformatic Applications of Hi-C and Linked Reads

Libo Jiang et al. Genomics Proteomics Bioinformatics. .

Abstract

Long-range sequencing grants insight into additional genetic information beyond what can be accessed by both short reads and modern long-read technology. Several new sequencing technologies, such as "Hi-C" and "Linked Reads", produce long-range datasets for high-throughput and high-resolution genome analyses, which are rapidly advancing the field of genome assembly, genome scaffolding, and more comprehensive variant identification. In this review, we focused on five major long-range sequencing technologies: high-throughput chromosome conformation capture (Hi-C), 10X Genomics Linked Reads, haplotagging, transposase enzyme linked long-read sequencing (TELL-seq), and single- tube long fragment read (stLFR). We detailed the mechanisms and data products of the five platforms and their important applications, evaluated the quality of sequencing data from different platforms, and discussed the currently available bioinformatics tools. This work will benefit the selection of appropriate long-range technology for specific biological studies.

Keywords: Genome assembly; Hi-C; Linked Reads; Long-range NGS reads; Quality assessment.

PubMed Disclaimer

Conflict of interest statement

The authors have declared no competing interests.

Figures

Figure 1
Figure 1
Work flow of library preparation for five long-range platforms A. Hi-C. B. 10X Genomics Linked Reads. C. Haplotagging, TELL-seq, and stLFR. Hi-C, high-throughput chromosome conformation capture; TELL-seq, transposase enzyme linked long-read sequencing; stLFR, single-tube long fragment read; GEM, gel bead in emulsion; HMW-DNA, high molecular weight DNA.
Figure 2
Figure 2
Hi-C maps for three human datasets A. Arima V2 NA12878-CEU. B. Arima V2 NA24385-AJ. C. Arima V1 NA12878-CEU. Each square block of the map represents an individual human chromosome, and darker region indicates higher contact density. These three datasets are all from female samples as there are hardly any contact interactions in chromosome Y.
Figure 3
Figure 3
Characteristics of Hi-C reads A. Link-separation distance distribution: the distance in the linear genome between two reads which are coupled together by the Hi-C protocol, grouped into bins of 100 bp, and expressed as a percentage frequency. The Arima V2 Oak dataset is included, which demonstrates the breakdown of the power-law relationship of Equation 1, highlighting the desirable features present in the human datasets. The peaks which appear in all three human datasets at LSD1.085×107 are of unknown origins, though they are suspercted to be artefacts of the alignment methods used. B. ICI rate: the percentage of paired reads mapped to different chromsomes. The existance of inter-chromsomal pairs is not a desired feature for our purposes. In our quality control experiments, we set up a threshold of 30%, above which the dataset will be marked as failure. C. Base coverage distribution. Although all datasets are covered to approximately the same depth (30×), they show very different distributions around this value — a non-Hi-C dataset (Illumina) is included as a comparison. LSD, link-eparation distance; ICI, inter-chromosomal interaction.
Figure 4
Figure 4
Length distributions for various 10X and haplotagging datasets Reads are grouped into fragments by barcodes, with shared barcodes identified and removed. The length of a fragment is the region covered by mapping coordinates from the Linked Reads which share the same barcode.
Figure 5
Figure 5
Base coverage profiles for various 10X and haplotagging datasets The 10X and haplotagging downsampled datasets at ∼ 30× are used to remove the effects of differing coverage depths. In terms of coverage evenness, the datasets of rat and oak are not as smooth as other samples.
Figure 6
Figure 6
Hi-C maps on contigs and scaffolded assemblies A. Contig fragmentation is clealy observed in the Hi-C map. B. Assembly with Arima V1 data. A chromosome-level assembly is in shape, while some small contigs can still be obsevrved in the lower-right corner. C. Assembly with Arima V2 data. A much improved chromosome-level assembly is observed.

References

    1. Sethi R, Becker J, de Graaf J, Löwer M, Suchan M, Sahin U, et al.Integrative analysis of structural variations using short-reads and linked-reads yields highly specific and sensitive predictions. PLoS Comput Biol 2020;16:e1008397. - PMC - PubMed
    1. Goodwin S, McPherson JD, McCombie WR.. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 2016;17:333–51. - PMC - PubMed
    1. Ott A, Schnable JC, Yeh CT, Wu L, Liu C, Hu HC, et al.Linked read technology for assembling large complex and polyploid genomes. BMC Genomics 2018;19:651. - PMC - PubMed
    1. Logsdon GA, Vollger MR, Eichler EE.. Long-read human genome sequencing and its applications. Nat Rev Genet 2020;21:597–614. - PMC - PubMed
    1. Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME, Gouil Q.. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 2020;21:30. - PMC - PubMed

LinkOut - more resources