Automated removal of noisy data in phylogenomic analyses
- PMID: 20976444
- DOI: 10.1007/s00239-010-9398-z
Automated removal of noisy data in phylogenomic analyses
Abstract
Noisy data, especially in combination with misalignment and model misspecification can have an adverse effect on phylogeny reconstruction; however, effective methods to identify such data are few. One particularly important class of noisy data is saturated positions. To avoid potential errors related to saturation in phylogenomic analyses, we present an automated procedure involving the step-wise removal of the most variable positions in a given data set coupled with a stopping criterion derived from correlation analyses of pairwise ML distances calculated from the deleted (saturated) and the remaining (conserved) subsets of the alignment. Through a comparison with existing methods, we demonstrate both the effectiveness of our proposed procedure for identifying noisy data and the effect of the removal of such data using a well-publicized case study involving placental mammals. At the least, our procedure will identify data sets requiring greater data exploration, and we recommend its use to investigate the effect on phylogenetic analyses of removing subsets of variable positions exhibiting weak or no correlation to the rest of the alignment. However, we would argue that this procedure, by identifying and removing noisy data, facilitates the construction of more accurate phylogenies by, for example, ameliorating potential long-branch attraction artefacts.
Similar articles
-
Impact of missing data on phylogenies inferred from empirical phylogenomic data sets.Mol Biol Evol. 2013 Jan;30(1):197-214. doi: 10.1093/molbev/mss208. Epub 2012 Aug 28. Mol Biol Evol. 2013. PMID: 22930702
-
Removal of noisy characters from chloroplast genome-scale data suggests revision of phylogenetic placements of Amborella and Ceratophyllum.J Mol Evol. 2009 Mar;68(3):197-204. doi: 10.1007/s00239-009-9206-9. Epub 2009 Feb 27. J Mol Evol. 2009. PMID: 19247564
-
Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.Mol Biol Evol. 2000 Apr;17(4):540-52. doi: 10.1093/oxfordjournals.molbev.a026334. Mol Biol Evol. 2000. PMID: 10742046
-
The phylogenetic position of Myxozoa: exploring conflicting signals in phylogenomic and ribosomal data sets.Mol Biol Evol. 2010 Dec;27(12):2733-46. doi: 10.1093/molbev/msq159. Epub 2010 Jun 24. Mol Biol Evol. 2010. PMID: 20576761
-
Large-scale assignment of orthology: back to phylogenetics?Genome Biol. 2008 Oct 30;9(10):235. doi: 10.1186/gb-2008-9-10-235. Genome Biol. 2008. PMID: 18983710 Free PMC article. Review.
Cited by
-
Organelle Phylogenomics and Extensive Conflicting Phylogenetic Signals in the Monocot Order Poales.Front Plant Sci. 2022 Jan 31;12:824672. doi: 10.3389/fpls.2021.824672. eCollection 2021. Front Plant Sci. 2022. PMID: 35173754 Free PMC article.
-
Noise and biases in genomic data may underlie radically different hypotheses for the position of Iguania within Squamata.PLoS One. 2018 Aug 22;13(8):e0202729. doi: 10.1371/journal.pone.0202729. eCollection 2018. PLoS One. 2018. PMID: 30133514 Free PMC article.
-
Two new fern chloroplasts and decelerated evolution linked to the long generation time in tree ferns.Genome Biol Evol. 2014 Apr 30;6(5):1166-73. doi: 10.1093/gbe/evu087. Genome Biol Evol. 2014. PMID: 24787621 Free PMC article.
-
Water lily (Nymphaea thermarum) genome reveals variable genomic signatures of ancient vascular cambium losses.Proc Natl Acad Sci U S A. 2020 Apr 14;117(15):8649-8656. doi: 10.1073/pnas.1922873117. Epub 2020 Mar 31. Proc Natl Acad Sci U S A. 2020. PMID: 32234787 Free PMC article.
-
A Guide to Phylogenomic Inference.Methods Mol Biol. 2024;2802:267-345. doi: 10.1007/978-1-0716-3838-5_11. Methods Mol Biol. 2024. PMID: 38819564
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources