Characterization of structural variants with single molecule and hybrid sequencing approaches

Anna Ritz¹, Ali Bashir², Suzanne Sindi¹, David Hsu¹, Iman Hajirasouliha¹, Benjamin J Raphael²

Affiliations

¹ Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI.
² Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI.

PMID: 25355789
PMCID: PMC4253835
DOI: 10.1093/bioinformatics/btu714

Characterization of structural variants with single molecule and hybrid sequencing approaches

Anna Ritz et al. Bioinformatics. 2014.

. 2014 Dec 15;30(24):3458-66.

doi: 10.1093/bioinformatics/btu714. Epub 2014 Oct 28.

Authors

Anna Ritz¹, Ali Bashir², Suzanne Sindi¹, David Hsu¹, Iman Hajirasouliha¹, Benjamin J Raphael²

Affiliations

¹ Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI.
² Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI Department of Computer Science, Brown University, RI Department of Genetics and Genomic Sciences, Icahn School of Medicine, Mount Sinai, NY Institute for Genomics and Multiscale Biology, Icahn School of Medicine, Mount Sinai, NY School of Natural Sciences, University of California, Merced, CA Pacific Biosciences, Menlo Park, CA Center for Computational Molecular Biology, Brown University, RI.

PMID: 25355789
PMCID: PMC4253835
DOI: 10.1093/bioinformatics/btu714

Abstract

Motivation: Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent 'third-generation' sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates.

Results: We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly.

PubMed Disclaimer

Figures

**Fig. 1.**
Overview of MultiBreak-SV. (1) Five long reads are sequenced from an individual genome. (2) The reads are aligned to the reference genome, producing seven distinct multi-breakpoint-mappings. When clustered, the multi-breakpoint-mappings indicate four novel adjacencies ( $D_{1}, D_{2}, D_{3}, D_{4}$ ). (3a) The quality of the read alignments (e.g. the edit distance) is noted for each multi-breakpoint-mapping. (3b) The set of all possible novel adjacencies ${D_{1}, D_{2}, D_{3}, D_{4}}$ is represented as a cluster diagram G, where the nodes are novel adjacencies and the directed edges represent overlapping novel adjacencies. (4) The cluster diagram and alignment qualities are input to MultiBreak-SV. (5a) MultiBreak-SV assigns probabilities to each multi-breakpoint-mapping. (5b) From these mappings, the probability of each novel adjacency is computed. A solution to the Multi-Read Mapping Problem is a selection of at most one alignment for each multi read and a selection of at most one novel adjacency for each connected component in G (bold)

**Fig. 2.**
(Left) ROC curve of the variant calling accuracy and (Right) precision-recall curve of the mapping accuracy for the Venter simulation. For both plots, solid lines are MultiBreak-SV predictions (denoted MBSV), the dotted line is a an algorithm designed for multi-breakpoint reads (Ritz *et al.*, 2010), and dashed lines are algorithms designed for paired-end reads: Hydra (Quinlan *et al.*, 2010), GASV (Sindi *et al.*, 2009), GASVPro (Sindi *et al.*, 2012),VariationHunter (VH) (Hormozdiari *et al.*, 2009), and Delly (Rausch *et al.*, 2012)

**Fig. 3.**
Distribution of CHM1TERT novel adjacencies predicted by MultiBreak-SV. (Left) Novel adjacency probabilities supported by at least one multi-breakpoint-mapping. Horizontal colored bands show the distribution of novel adjacencies by SV type. (Right) Novel adjacency probabilities supported by at least five multi-breakpoint-mappings

**Fig. 4.**
Examples of CHM1TERT novel adjacencies predicted by MultiBreak-SV. (A) Example of a high-probability translocation (prob=1.0 for k = 5) and an inversion (P = 0.999 for k = 1). Multi-breakpoint-mapping probabilities from MultiBreak-SV are shown next to each multi-breakpoint-mapping. (B) Confirmed deletion in the Illumina assembly. (C) Proposed deletion in the Illumina assembly

See this image and copyright information in PMC

References

1. 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073. - PMC - PubMed
1. Abyzov A, Gerstein M. Age: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision. Bioinformatics. 2011;27:595. - PMC - PubMed
1. Alkan C, et al. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 2011;12:363–376. - PMC - PubMed
1. Antonacci F, et al. Characterization of six human disease-associated inversion polymorphisms. Hum. Mol. Genet. 2009;18:2555–2566. - PMC - PubMed
1. Brown C. Single molecule strand sequencing using protein nanopores and scalable electronic devices. 2012 AGBT Conference.

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characterization of structural variants with single molecule and hybrid sequencing approaches

Affiliations

Characterization of structural variants with single molecule and hybrid sequencing approaches

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources