Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?
- PMID: 18366758
- PMCID: PMC2359737
- DOI: 10.1186/1471-2148-8-95
Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?
Abstract
Background: While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.
Results: We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.
Conclusion: These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.
Figures




Similar articles
-
SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1. Syst Biol. 2012. PMID: 22139466
-
Ancestral sequence alignment under optimal conditions.BMC Bioinformatics. 2005 Nov 17;6:273. doi: 10.1186/1471-2105-6-273. BMC Bioinformatics. 2005. PMID: 16293191 Free PMC article.
-
A hierarchical model for incomplete alignments in phylogenetic inference.Bioinformatics. 2009 Mar 1;25(5):592-8. doi: 10.1093/bioinformatics/btp015. Epub 2009 Jan 15. Bioinformatics. 2009. PMID: 19147663 Free PMC article.
-
Multiple sequence alignment: in pursuit of homologous DNA positions.Genome Res. 2007 Feb;17(2):127-35. doi: 10.1101/gr.5232407. Genome Res. 2007. PMID: 17272647 Review.
-
Statistics and truth in phylogenomics.Mol Biol Evol. 2012 Feb;29(2):457-72. doi: 10.1093/molbev/msr202. Epub 2011 Aug 26. Mol Biol Evol. 2012. PMID: 21873298 Free PMC article. Review.
Cited by
-
DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.PLoS One. 2013;8(3):e58537. doi: 10.1371/journal.pone.0058537. Epub 2013 Mar 15. PLoS One. 2013. PMID: 23554899 Free PMC article.
-
Insect phylogenomics: exploring the source of incongruence using new transcriptomic data.Genome Biol Evol. 2012;4(12):1295-309. doi: 10.1093/gbe/evs104. Genome Biol Evol. 2012. PMID: 23175716 Free PMC article.
-
Accounting for alignment uncertainty in phylogenomics.PLoS One. 2012;7(1):e30288. doi: 10.1371/journal.pone.0030288. Epub 2012 Jan 17. PLoS One. 2012. PMID: 22272325 Free PMC article.
-
On the phylogenetic position of Myzostomida: can 77 genes get it wrong?BMC Evol Biol. 2009 Jul 1;9:150. doi: 10.1186/1471-2148-9-150. BMC Evol Biol. 2009. PMID: 19570199 Free PMC article.
-
Ecosystem-specific selection pressures revealed through comparative population genomics.Proc Natl Acad Sci U S A. 2010 Oct 26;107(43):18634-9. doi: 10.1073/pnas.1009480107. Epub 2010 Oct 11. Proc Natl Acad Sci U S A. 2010. PMID: 20937887 Free PMC article.
References
-
- Philippe H, Delsuc F, Brinkmann H, Lartillot N. Phylogenomics. Annual Review of Ecology, Evolution, and Systematics. 2005;36:541–562. doi: 10.1146/annurev.ecolsys.35.112202.130205. - DOI
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources