Multiscale DNA partitioning: statistical evidence for segments
- PMID: 24753487
- DOI: 10.1093/bioinformatics/btu180
Multiscale DNA partitioning: statistical evidence for segments
Abstract
Motivation: DNA segmentation, i.e. the partitioning of DNA in compositionally homogeneous segments, is a basic task in bioinformatics. Different algorithms have been proposed for various partitioning criteria such as Guanine/Cytosine (GC) content, local ancestry in population genetics or copy number variation. A critical component of any such method is the choice of an appropriate number of segments. Some methods use model selection criteria and do not provide a suitable error control. Other methods that are based on simulating a statistic under a null model provide suitable error control only if the correct null model is chosen.
Results: Here, we focus on partitioning with respect to GC content and propose a new approach that provides statistical error control: as in statistical hypothesis testing, it guarantees with a user-specified probability [Formula: see text] that the number of identified segments does not exceed the number of actually present segments. The method is based on a statistical multiscale criterion, rendering this as a segmentation method that searches segments of any length (on all scales) simultaneously. It is also accurate in localizing segments: under benchmark scenarios, our approach leads to a segmentation that is more accurate than the approaches discussed in the comparative review of Elhaik et al. In our real data examples, we find segments that often correspond well to features taken from standard University of California at Santa Cruz (UCSC) genome annotation tracks.
Availability and implementation: Our method is implemented in function smuceR of the R-package stepR available at http://www.stochastik.math.uni-goettingen.de/smuce.
© The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Similar articles
-
Comparative testing of DNA segmentation algorithms using benchmark simulations.Mol Biol Evol. 2010 May;27(5):1015-24. doi: 10.1093/molbev/msp307. Epub 2009 Dec 16. Mol Biol Evol. 2010. PMID: 20018981
-
High-level organization of isochores into gigantic superstructures in the human genome.Phys Rev E Stat Nonlin Soft Matter Phys. 2011 Mar;83(3 Pt 1):031908. doi: 10.1103/PhysRevE.83.031908. Epub 2011 Mar 15. Phys Rev E Stat Nonlin Soft Matter Phys. 2011. PMID: 21517526
-
Modified screening and ranking algorithm for copy number variation detection.Bioinformatics. 2015 May 1;31(9):1341-8. doi: 10.1093/bioinformatics/btu850. Epub 2014 Dec 25. Bioinformatics. 2015. PMID: 25542927 Free PMC article.
-
UCSC genome browser tutorial.Genomics. 2008 Aug;92(2):75-84. doi: 10.1016/j.ygeno.2008.02.003. Epub 2008 Jun 2. Genomics. 2008. PMID: 18514479 Review.
-
Statistical challenges associated with detecting copy number variations with next-generation sequencing.Bioinformatics. 2012 Nov 1;28(21):2711-8. doi: 10.1093/bioinformatics/bts535. Epub 2012 Aug 31. Bioinformatics. 2012. PMID: 22942022 Review.
Cited by
-
LDJump: Estimating variable recombination rates from population genetic data.Mol Ecol Resour. 2019 May;19(3):623-638. doi: 10.1111/1755-0998.12994. Epub 2019 Apr 4. Mol Ecol Resour. 2019. PMID: 30666785 Free PMC article.
-
Drosophila simulans: A Species with Improved Resolution in Evolve and Resequence Studies.G3 (Bethesda). 2017 Jul 5;7(7):2337-2343. doi: 10.1534/g3.117.043349. G3 (Bethesda). 2017. PMID: 28546383 Free PMC article.
-
Whole exome sequencing of wild-derived inbred strains of mice improves power to link phenotype and genotype.Mamm Genome. 2017 Oct;28(9-10):416-425. doi: 10.1007/s00335-017-9704-9. Epub 2017 Aug 17. Mamm Genome. 2017. PMID: 28819774 Free PMC article.
-
Estimating the Effective Population Size from Temporal Allele Frequency Changes in Experimental Evolution.Genetics. 2016 Oct;204(2):723-735. doi: 10.1534/genetics.116.191197. Epub 2016 Aug 19. Genetics. 2016. PMID: 27542959 Free PMC article.
-
On optimal multiple changepoint algorithms for large data.Stat Comput. 2017;27(2):519-533. doi: 10.1007/s11222-016-9636-3. Epub 2016 Feb 15. Stat Comput. 2017. PMID: 32355427 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Miscellaneous