Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 May 27;22(1):282.
doi: 10.1186/s12859-021-04149-w.

A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes

Affiliations

A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes

Jindan Guo et al. BMC Bioinformatics. .

Abstract

Background: With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge.

Results: We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure.

Conclusions: Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB .

Keywords: Coordinate system; Genome graph; Variant detection.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Schematic diagram for the construction, decomposition and reorganization of the colored de Bruijn graph. a The three original sequences. b The colored de Bruijn graph with k equal to 3. Each colored circle represents a node, and the black arrows represent edges. The characters above the node represent each node's label, visit order and color. The number below node represents the theoretical offset value of each node. c The result of visiting against the direction of the edge. The numbers below the node represent the final offset value and preoffset value. d The cSupB tree structure. Here, we find five cSupBs in the following order: bub1.<TAT,ACC,111>; bub2.<GGG,GTA,110>; bub3.<CAC,GTA,011>; bub4.<TCA,GGG,110>; and bub5.<TCA,GTA,111>. e The final decomposition and representation of the colored de Bruijn graph
Fig. 2
Fig. 2
Half-visited nodes in two types of cycles. Each circle represents a node, and the red dots and edges form a cycle. In the circle, the number before the bracket indicates the node ID, and the number in the bracket represents the node color. The color of the number indicates the visiting state of the node: blue indicates fully visited, green indicates half-visited, and black indicates unvisited. a Graph composed of two samples, where the cycle belongs to type I. When the traversal stops, two half-visited nodes, 3 and 7, are generated. Among them, 3 is of the first type and 7 is of the second type. b Graph composed of four samples, where the cycle is a type II cycle. When the traversal stops, three half-visited nodes (3, 8, and 13) are generated. Among them, 3 and 8 are of the first type, and 13 is of the second type. c The revisiting result of a. When the traversal is finished in advance, two half-visited nodes (3 and 7) are obtained, and the intersection of the colors of the two nodes is not empty. At this time, pos(1)=1,pos(3)=pos(2)=2,pos(7)=3, and node 3 is selected as cycle_start_node, cycle_start_pos=2. Then, by continuing to visit, we can obtain pos(4)=3,pos(5)=4,pos(6)=pos(7)=5 and know that cycle_end_node is 6, cycle_end_pos=5. Finally, the interval of the cycle is [2, 5]
Fig. 3
Fig. 3
Variant identification and variable transformation
Fig. 4
Fig. 4
Different gap distributions near the source node
Fig. 5
Fig. 5
Precision and recall for different simulation parameters
Fig. 6
Fig. 6
Final cutting region and reference cut position

Similar articles

Cited by

References

    1. Rhoads A, Au KF. PacBio sequencing and its applications. Genomics Proteomics Bioinformatics. 2015;13(5):278–289. doi: 10.1016/j.gpb.2015.08.002. - DOI - PMC - PubMed
    1. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends Genet. 2018;34(9):666–681. doi: 10.1016/j.tig.2018.05.008. - DOI - PubMed
    1. Kucherov G. Evolution of biosequence search algorithms: a brief survey. Bioinformatics. 2019;35(19):3547–3552. doi: 10.1093/bioinformatics/btz272. - DOI - PubMed
    1. Zekic T, Holley G, Stoye J. Pan-genome storage and analysis techniques. Methods Mol Biol. 2018;1704:29–53. doi: 10.1007/978-1-4939-7463-4_2. - DOI - PubMed
    1. Computational Pan-Genomics C. Computational pan-genomics: status, promises and challenges. Brief Bioinform. 2018;19(1):118–135. - PMC - PubMed