Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Mar 26:8:95.
doi: 10.1186/1471-2148-8-95.

Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?

Affiliations

Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?

Stefanie Hartmann et al. BMC Evol Biol. .

Abstract

Background: While full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.

Results: We found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.

Conclusion: These results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Patterns of gappy alignments. Rows represent individual sequences, and black regions indicate missing data. A. A concatenated alignment of three genes, not all of which have been obtained from all species. B. Gap patterns used for the artificial alignments. Each gap pattern is based on a single gene family in the Phytome database. The total percentage of missing amino acids for each alignment is as follows. 1: 14%; 2: 21%; 3: 20%; 4: 29%; 5: 46%; 6: 54%; 7: 55%; 8: 60%; 9: 56%; 10: 58%. C. Example of column-deleted and random-deleted control alignments. The examples shown contain the same percentage of missing amino acids as gap pattern 4 in panel B.
Figure 2
Figure 2
Phylogenetic accuracy and the retention of sequences. A. Distribution of standardized quartet distances between estimated phylogenies and the corresponding true trees. Leftmost column: full alignments with no gap pattern applied. Green: gap pattern applied, phylogeny inferred directly (without the use of masking or SIA). Red: alignment masking. Blue: SIA. B. Proportion of sequences retained per family. Boxplots show the median (horizontal black bar) and interquartile range (colored boxes).
Figure 3
Figure 3
Relationship between phylogenetic accuracy and the proportion of sequences retained using REAP. The two REAP runs with the parameters determined to be optimal for the simulated data are indicated by black circles around the data points.
Figure 4
Figure 4
Overview of SIA method. 1. Initial gappy alignment (The example shows an alignment of six sequences (A-F). "X" represents any amino acid; "-" represents a gap or missing data.); 2. The overlap-graph and two maximal cliques (green and purple); 3. Assignment of columns to cliques. The red column is placed in the smaller of the two cliques; 4. Two subalignments corresponding to the two cliques; 5. The resulting submatrices, and the combined matrix, of pairwise distances. Yellow cells are represented in both the green and purple submatrices. Orange cells must be imputed. ; 6. The phylogenetic tree estimated from the combined distance matrix. See text for details.

Similar articles

Cited by

References

    1. de la Torre J, Egan M, Katari M, Brenner E, Stevenson D, Coruzzi G, DeSalle R. ESTimating plant phylogeny: lessons from partitioning. BMC Evolutionary Biology. 2006;6(48) - PMC - PubMed
    1. Sanderson MJ, Driskell AC. The challenge of constructing large phylogenies. Trends in Plant Science. 2003;8(8):374–379. doi: 10.1016/S1360-1385(03)00165-1. - DOI - PubMed
    1. Driskell A, Ane C, Burleigh J, McMahon M, O'Meara B, Sanderson M. Prospects for building the tree of life from large sequence databases. Science. 2004;306(5699):1172–1174. doi: 10.1126/science.1102036. - DOI - PubMed
    1. Philippe H, Delsuc F, Brinkmann H, Lartillot N. Phylogenomics. Annual Review of Ecology, Evolution, and Systematics. 2005;36:541–562. doi: 10.1146/annurev.ecolsys.35.112202.130205. - DOI
    1. Rokas A, Williams B, King N, Carroll S. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425(6960):798–804. doi: 10.1038/nature02053. - DOI - PubMed

Publication types