Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 1;36(13):3966-3974.
doi: 10.1093/bioinformatics/btaa288.

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C

Affiliations

HiC-Hiker: a probabilistic model to determine contig orientation in chromosome-length scaffolds with Hi-C

Ryo Nakabayashi et al. Bioinformatics. .

Abstract

Motivation: De novo assembly of reference-quality genomes used to require enormously laborious tasks. In particular, it is extremely time-consuming to build genome markers for ordering assembled contigs along chromosomes; thus, they are only available for well-established model organisms. To resolve this issue, recent studies demonstrated that Hi-C could be a powerful and cost-effective means to output chromosome-length scaffolds for non-model species with no genome marker resources, because the Hi-C contact frequency between a pair of two loci can be a good estimator of their genomic distance, even if there is a large gap between them. Indeed, state-of-the-art methods such as 3D-DNA are now widely used for locating contigs in chromosomes. However, it remains challenging to reduce errors in contig orientation because shorter contigs have fewer contacts with their neighboring contigs. These orientation errors lower the accuracy of gene prediction, read alignment, and synteny block estimation in comparative genomics.

Results: To reduce these contig orientation errors, we propose a new algorithm, named HiC-Hiker, which has a firm grounding in probabilistic theory, rigorously models Hi-C contacts across contigs, and effectively infers the most probable orientations via the Viterbi algorithm. We compared HiC-Hiker and 3D-DNA using human and worm genome contigs generated from short reads, evaluated their performances, and observed a remarkable reduction in the contig orientation error rate from 4.3% (3D-DNA) to 1.7% (HiC-Hiker). Our algorithm can consider long-range information between distal contigs and precisely estimates Hi-C read contact probabilities among contigs, which may also be useful for determining the ordering of contigs.

Availability and implementation: HiC-Hiker is freely available at: https://github.com/ryought/hic_hiker.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(1) A typical workflow of de novo assembly using 3D-DNA and HiC-Hiker. HiC-Hiker is used as a post-processing step of the current 3D-DNA pipeline to correct local orientation errors. (2) (a), (b) and (c) show schematic dot plots (reference on the x-axis; three scaffolds on the y-axis) in the assembling procedure of an example scaffold composed of six mock contigs. The red and blue lines in the dot plots represent forward- and reverse-complement alignments of scaffolds, respectively. The arrays of arrows on both axes represent the ordering and orientation of contigs in the reference and scaffolds. The leftmost dot plot (a) illustrates 3D-DNA chromosome-length scaffolds with two major errors; the top-right large inversion that involves contigs B1–B3 and the misoriented short contig labeled A2. The former large inversion error can be corrected using Juicebox Assembly Tools (JBAT), since such large errors are typically apparent in plots of the Hi-C contact frequency matrix (b). In contrast, the latter, minor error is often difficult to detect in the contact matrix, so we propose the use of HiC-Hiker to fix the small misorientations; (c) shows revised scaffolds
Fig. 2.
Fig. 2.
(a) A schematic representation of a scaffold and a Hi-C contact in our formalization. (b) There are four cases orienting two contigs, i.e. the ith and jth contigs; in each case, the distance between the contact points of rRi,j is fixed depending on θi,θj, as illustrated by the lengths of the red dotted lines
Fig. 3.
Fig. 3.
A sketch of the hidden Markov model proposed in this study, for the case where k =3 and n =5. The hidden states and all transitions are shown on the right, and the corresponding observations are shown on the left. For example, (θ1,θ2,θ3)=(+,,) is chosen because its corresponding observation is the most probable with the minimum sum of distances between pairs of contacts; therefore, its emission probability is the highest; that is, the product of probabilities of Hi-C contact pairs is the highest among the eight contig orientation patterns. For the other two hidden states, suppose that (θ2,θ3,θ4)=(,,+) and (θ3,θ4,θ5)=(,+,) are selected. The most likely path in this model, which is shown in red, represents the most probable orientations of the contigs (+,,,+,)
Fig. 4.
Fig. 4.
Comparison of contact probability distributions among the longest contig (blue), second longest contig (yellow) and the third longest contig (green). The smoothed distribution calculated from the longest contig is shown by the red line, which is actually used as P(d). This distribution appears to represent a good approximation of the top three distributions. P(d) was set to a constant probability when d> 75 kb, shown as a gray-colored region
Fig. 5.
Fig. 5.
Local error rates of the human scaffolds generated by 3D-DNA and refined by HiC-Hiker. The error rate was remarkably reduced from 4.3% (3D-DNA) to 1.7% (HiC-Hiker)
Fig. 6.
Fig. 6.
Four dot plots of scaffolds along human chromosome 8. The top row shows dot plots where the reference genome is on the x-axis and scaffolds output by 3D-DNA (left) and HiC-Hiker (right) are on the y-axis. The red-colored dots indicate correct orientations of contigs (forward alignment with the reference) while the blue-colored dots show erroneous orientations (reverse-complement alignments). The dot plots show that the reference is mostly covered by contigs. The total length of contigs in the scaffolds is 123 011 808 bp, which is close to 145 138 636 bp, i.e. the length of chromosome 8 in hg38. In the left bottom portions of both of the upper dot plots, we see large reverse-complement alignments of the reference and scaffolds. In the lower dot plots, we enlarged parts of the upper two plots to show the reference genomic region, which ranges from 65 to 74 Mb. The six orientation errors of short contigs shown in the lower left plot shown as blue dots are corrected in the HiC-Hiker scaffold shown in the lower right plot
Fig. 7.
Fig. 7.
Plot of local error rates of contigs according to their lengths. Overall, shorter contigs are difficult to orient due to insufficient Hi-C contacts, but this plot indicates that HiC-Hiker outperformed 3D-DNA in repairing the orientations not only of shorter contigs, but also of longer contigs
Fig. 8.
Fig. 8.
Relative frequency matrices Mi,j for four typical cases. The identifiers of individual contigs, i and j, are shown beside the vertical and horizontal axes, respectively. The red-blue color of each cell shows the relative probability of its orientation. The schematic figure at the top illustrates that a red-colored cell (Mi,j=1) means that contacts between the ith and jth contigs determine the correct orientation of the ith contig, while a blue-colored cell (Mi,j=0) shows that contacts between the two contigs disagree with the correct orientation of the ith contig but support the wrong orientation erroneously. The top left matrix (a) shows the ideal situation, in which each contig is long enough to have sufficient contacts with its neighbors to allow its orientation to be determined before using HiC-Hiker. Since almost all of the contigs in this region are longer than the threshold of K= 75 kb in the probabilistic model, their orientations could be determined based only on their adjacent contigs; contacts with distant contigs were not required. In the top right (b), the row labeled with 806 shows a case where the orientation of contig 806 cannot be correctly determined according to its contacts with contig 805, which is depicted by the blue-colored cell of 805 and 806. HiC-Hiker corrected the misoriented contig 806 by considering its contacts with long contigs 804 and 807. In the bottom left (c), the row labeled with 1248 illustrates a situation where the misoriented contig 1248 is difficult to fix if its neighboring contigs 1249 is taken into account, but can be fixed using 1247 and 1250, all of which are small contigs (<50 kb). In the bottom right (d), the row labeled with 1338 shows a case when HiC-Hiker fails to correct the misoriented contig 1338 labeled with ‘x’, which has short contigs on its right side. The relative probability of contig 1338 being close to 1/2 (in the row 1338) indicates insufficient information regarding the orientation of 1338; it is difficult to determine its orientation based on neighboring contigs

References

    1. Bankevich A. et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol., 19, 455–477. - PMC - PubMed
    1. Burton J.N. et al. (2013) Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol., 31, 1119–1125. - PMC - PubMed
    1. Butler J. et al. (2008) ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research, 18, 810–820. - PMC - PubMed
    1. Carty M. et al. (2017) An integrated model for detecting significant chromatin interactions from high-resolution Hi-C data. Nat. Commun., 8, 1–10. - PMC - PubMed
    1. Clavijo B.J. et al. (2017. a) An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res., 27, 885–896. - PMC - PubMed

Publication types