Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Feb;20(1):205-218.
doi: 10.1016/j.gpb.2021.03.007. Epub 2021 Jul 3.

Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants

Collaborators, Affiliations

Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants

Jiadong Lin et al. Genomics Proteomics Bioinformatics. 2022 Feb.

Abstract

Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.

Keywords: Complex structural variant; Formation mechanism; Graph mining; Next-generation sequencing; Pattern growth.

PubMed Disclaimer

Conflict of interest statement

Competing Interests The authors have declared no competing interests.

Figures

Figure 1
Figure 1
Explanation of simple and complexSValignment models derived from abnormal read-pairs A. Three common simple SVs and their corresponding abnormal read-pair alignments on the reference genome. B. The alignment signature of two CSVs. Each involves two types of signatures that can be matched by a simple SV alignment model. SV, structural variant; CSV, complex structural variant; Ref, reference; Dup, duplication; Inv, inversion; Del, deletion.
Figure 2
Figure 2
Overview of Mako Mako first builds a signal graph by collecting abnormally aligned reads as nodes, and their edge connections are provided by paired-end alignment and split alignment. Afterward, Mako utilizes the pattern growth approach to find a maximal subgraph as a potential CSV site. In the example output, the maximal subgraph G contains nodes A, B, C, and D, whereas F is not able to be appended because of no existing edge (dashed line). The CSV is derived from this subgraph with estimated breakpoints and CXS, where the discovered CSV subgraph contains four different nodes, one Eae edge of type Inv, and two Epe edges of type Del. CXS, complexity score.
Figure 3
Figure 3
Performance comparison on simulated CSVs with different match criteria A. The sensitivity of detecting breakpoints of heterozygous CSVs. B. The sensitivity of detecting breakpoints of homozygous CSVs. C. Evaluation of reported heterozygous CSV simulation. D. Evaluation of reported homozygous CSV simulation. E. Evaluation of randomized heterozygous CSV simulation. F. Evaluation of randomized homozygous CSV simulation. The performances of selected tools for detecting simulated CSVs are evaluated according to the all-breakpoint match (A and B) and unique-interval match (C–F) criteria. In C–F, the performance is evaluated by recall (y-axis), precision (x-axis), and F1-score (dotted lines). The right top corner of the plot indicates better performance. The c5–c30 indicates coverage, e.g., c5 indicates 5× coverage.
Figure 4
Figure 4
Overview of performanceofMako, GRIDSS, SVelter, and TARDISon NA19240 and SKBR3 A. Venn diagram of callsets detected from NA19240 by four selected tools. B. Venn diagram of callsets detected from SKBR3 by four selected tools as well as MergedSet. The Venn diagrams are created by 50% reciprocal overlap via a publicly available tool Intervene with ‘‘–bedtools-options” enabled. The MergedSet is obtained from the original publication. C. The percentages of completely and uniquely discovered CSVs from the NA19240 and SKBR3 data, respectively. The results of Mako are shown according to different CXS thresholds.
Figure 5
Figure 5
Two representative CSV subgraphs identified by Mako A. and B. Top: IGV views of the two representative CSV events. The alignments are grouped by read-pair orientation. Bottom: subgraph structures discovered by Mako. The colored circles and solid lines are nodes and edges in the subgraph. C. The alignment model of two deletions with an inverted spacer. D. The alignment model of deletion associated with dispersed duplication. In (C) and (D), short arrows are paired-end reads that span breakpoint junctions, and their alignments are shown on the Ref genome with the corresponding ID in the circle. Noted that a single ID may have more than one corresponding abnormal alignment types on the Ref genome. IGV, Itegrative Genomics Viewer.
Figure 6
Figure 6
Overview of Mako’s CSV discoveries from three healthy samples and proposed CSV formation mechanisms A. Summary of discovered CSV types. These types are reconstructed by PacBio HiFi reads, where a type with less than 10 events is summarized as RareType. B. Diagrams of two novel and rare CSV types discovered by Mako. In particular, Mako finds three Tantrans events and only one TanDisdup event. C.–E. Different replication diagrams explaining the impact of homology pattern for MMBIR-produced CSVs. In these diagrams, sequence abc has been replicated before the replication fork collapse (flash symbol). The single-strand DNA at the DNA DSB starts searching for homology sequence (purple and green triangles) to repair. The a forementioned procedure is explicitly explained as a replication graph, where nodes are homology sequences and edges keep track of TS (dotted arrow lines) as well as the normal replication at different strands (red lines). If there are two red lines between two nodes, the sequence between these two nodes will be replicate twice, as shown in (D). InsDup, insertion associated with duplication; Disdup, dispersed duplication; Invdup, inverted duplication; DelInvdup, deletion associated with inverted duplication; InsInvdup, insertion associated with inverted duplication; DelDisdup, deletion associated with dispersed duplication; DelInv, deletion associated with inversion; Tantrans, adjacent segment swap; TanDisdup, tandem dispersed duplication; MMBIR, microhomology-mediated break-induced replication; DSB, double-strand break; TS, template switch.

References

    1. Ye K., Schulz M.H., Long Q., Apweiler R., Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25:2865–2871. - PMC - PubMed
    1. Rausch T., Zichner T., Schlattl A., Stutz A.M., Benes V., Korbel J.O. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–i339. - PMC - PubMed
    1. Layer R.M., Chiang C., Quinlan A.R., Hall I.M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15:R84. - PMC - PubMed
    1. Chen K., Wallis J.W., McLellan M.D., Larson D.E., Kalicki J.M., Pohl C.S., et al. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6:677–681. - PMC - PubMed
    1. Cameron D.L., Di Stefano L., Papenfuss A.T. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun. 2019;10:3240. - PMC - PubMed

Publication types