. 2019 Jan 3;20(1):1.

doi: 10.1186/s12859-018-2565-8.

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

Jun Ruan¹, Zhen Liu¹, Ming Sun¹, Yue Wang², Junqiu Yue³, Guoqiang Yu⁴

Affiliations

¹ School of Information Engineering, Wuhan University of Technology, Wuhan, Hubei, 430070, China.
² Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA.
³ Department of Pathology, Hubei Cancer Hospital, Wuhan, Hubei, 430079, China.
⁴ Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA. yug@vt.edu.

PMID: 30606105
PMCID: PMC6318921
DOI: 10.1186/s12859-018-2565-8

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

Jun Ruan et al. BMC Bioinformatics. 2019.

. 2019 Jan 3;20(1):1.

doi: 10.1186/s12859-018-2565-8.

Authors

Jun Ruan¹, Zhen Liu¹, Ming Sun¹, Yue Wang², Junqiu Yue³, Guoqiang Yu⁴

Affiliations

¹ School of Information Engineering, Wuhan University of Technology, Wuhan, Hubei, 430070, China.
² Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA.
³ Department of Pathology, Hubei Cancer Hospital, Wuhan, Hubei, 430079, China.
⁴ Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA. yug@vt.edu.

PMID: 30606105
PMCID: PMC6318921
DOI: 10.1186/s12859-018-2565-8

Abstract

Background: Genome-wide DNA copy number changes are the hallmark events in the initiation and progression of cancers. Quantitative analysis of somatic copy number alterations (CNAs) has broad applications in cancer research. With the increasing capacity of high-throughput sequencing technologies, fast and efficient segmentation algorithms are required when characterizing high density CNAs data.

Results: A fast and informative segmentation algorithm, DBS (Deviation Binary Segmentation), is developed and discussed. The DBS method is based on the least absolute error principles and is inspired by the segmentation method rooted in the circular binary segmentation procedure. DBS uses point-by-point model calculation to ensure the accuracy of segmentation and combines a binary search algorithm with heuristics derived from the Central Limit Theorem. The DBS algorithm is very efficient requiring a computational complexity of O(n*log n), and is faster than its predecessors. Moreover, DBS measures the change-point amplitude of mean values of two adjacent segments at a breakpoint, where the significant degree of change-point amplitude is determined by the weighted average deviation at breakpoints. Accordingly, using the constructed binary tree of significant degree, DBS informs whether the results of segmentation are over- or under-segmented.

Conclusion: DBS is implemented in a platform-independent and open-source Java application (ToolSeg), including a graphical user interface and simulation data generation, as well as various segmentation methods in the native Java language.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
Segmentation process and binary tree of ℤ_p in DBS. a an assumed segmentation process with two breakpoints. Row [0] is the initial sequence to be split. Row [1] shows the first breakpoint would be found at loci b₁, and Row [2] is similar. b shows the corresponding binary tree of ℤ_p generated by (a). Here the identification of every node (Node ID) also is the Segment ID

**Fig. 2**
Segmentation process with simulation data in DBS. a shows the segmentation process by splitting multiple times. Notably, DBS uses a recursive algorithm. After Node 1, 3, 4, 5, and 7 were found one by one, Node 11, etc. at right part were discovered. The red lines over gray data points is the segmentation curves. The curves are the results of segmentation, and indicate the ranges and average of each sub-segment. b shows the corresponding binary tree of ℤ_p generated by the left panel (a). The red dotted line represents the position of the estimated standard deviation $\hat{σ}$ , and the red solid line represents the position of the threshold ${\hat{σ}}^{'}$ of degree of significant ℤ_p of breakpoints

**Fig. 3**
Segmentation process with an actual data sample in DBS (using half copy numbers). a the segmentation process in the binary tree of ℤ_p. b plots the copy number of an actual sample, and shows the position and ℤ_p of the 12 true breakpoints, which correspond to these yellow nodes in Panel (a). In (b), the observed copy number signals are the ratios of the measured intensity of tumor-normal matched sample

**Fig. 4**
ROC-curves of five segmentation methods. The curves show the sensitivity and specificity of accuracy for a sequence of thresholds as calculated by comparing aberration calls to the classifications made in a MLPA-analysis on the test dataset. (a) and (b) show that the classification accuracy is not affected much for a wide range of λ and γ. Here γ is equal to 0.02 in (a), and λ is equal to 0.02 in (b). c shows the effect of different combinations of window sizes. Curve W1 is the result using window sizes generated by the arithmetic progression with common difference of 1. Curve W2, W4 and W8 correspond to window sizes of the geometric sequence with common ratio of 2, 4 and 8 respectively. λ and γ is default value (0.02). d shows calls based on the segmentations found by DNAcopy v1.52.0 (CBS), copynumber v1.18 (PCF), the method in BACOM and DBS with raw data

**Fig. 5**
Computational complexity of time in the four algorithms. The solid lines with different colors represent the conventional linear regression models, which correspond to the data points with the same colors. The x-axis represents the logarithmic length of test samples (sequences), and the y-axis represents the logarithmic computation time

See this image and copyright information in PMC

References

1. Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, Borresen-Dale AL, Brown PO. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A. 2002;99(20):12963–12968. doi: 10.1073/pnas.162471999. - DOI - PMC - PubMed
1. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463(7283):899–905. doi: 10.1038/nature08822. - DOI - PMC - PubMed
1. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics. 2004;5(4):557–572. doi: 10.1093/biostatistics/kxh008. - DOI - PubMed
1. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657–663. doi: 10.1093/bioinformatics/btl646. - DOI - PubMed
1. Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004;90(1):132–153. doi: 10.1016/j.jmva.2004.02.008. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

Affiliations

DBS: a fast and informative segmentation algorithm for DNA copy number analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources