Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2004 Sep 6:5:123.
doi: 10.1186/1471-2105-5-123.

GASP: Gapped Ancestral Sequence Prediction for proteins

Affiliations

GASP: Gapped Ancestral Sequence Prediction for proteins

Richard J Edwards et al. BMC Bioinformatics. .

Abstract

Background: The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments.

Results: Here we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy.

Conclusions: GASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sample output from GASP. (a) The first 100 columns of a typical GASP ancestral sequence prediction output. Sequence order matches the default vertical ordering in the tree files produced by GASP. (b) A rooted version of the input tree. Lengths of branches are those defined in the input file. (c) A new version of the input tree with nodes labelled and branch lengths recalculated based on ancestral sequence prediction. Note. Data is output in (a) fasta format and (b & c) Newick format but for visual clarity the file has been shown using (a) BioEdit [14] and (b & c) TreeExporer [12].
Figure 2
Figure 2
Mean accuracies of methods using the 'PAM Variable Rates' Model. Error Bars are Standard Errors. Percentage Accuracies are calculated for variable sites only (see text for details). The Percentage Accuracy for all sites is higher in all cases (Data not shown). The 'PAM Equal Rates' and 'Random Equal Rates' Models gave very similar results (Data Not Shown). Data shown includes only those phylogenies that did not crash CODEML (see text).
Figure 3
Figure 3
Difference in prediction accuracies between GASP and three alternative algorithms. (a) Yang et al. 1995 ML [6], (b) Pupko et al. 2000 [7] and (c) PAMP [9]. Percentage Accuracies are calculated for variable sites only and only those phylogenies that did not crash CODEML are shown (see text for details). Positive values indicate GASP is better than the other algorithm and negative values the reverse. Results for each tree depth are calculated separately. Values shown are for 'PAM Variable Rates' simulations only but the other evolutionary models give very similar results (Data not shown).
Figure 4
Figure 4
Difference in GASP prediction accuracies of methods using gapped and ungapped 'PAM Equal Rates' simulations. Percentage Accuracies are calculated for variable sites only (see text for details). Positive figures indicate that accuracies for the gapped dataset are higher than for the corresponding ungapped dataset. Results for each tree depth are calculated separately.

References

    1. Zhang J, Nei M. Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods. J Mol Evol. 1997;44:S139–146. - PubMed
    1. Messier W, Stewart CB. Episodic adaptive evolution of primate lysozymes. Nature. 1997;385:151–154. doi: 10.1038/385151a0. - DOI - PubMed
    1. Caffrey DR, O'Neill LA, Shields DC. A method to predict residues conferring functional differences between related proteins: application to MAP kinase pathways. Protein Sci. 2000;9:655–670. - PMC - PubMed
    1. Fitch WM. Toward defining course of evolution – minimum change for a specific tree topology. Systematic Zoology. 1971;20:406–416.
    1. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997;13:555–556. - PubMed

LinkOut - more resources