Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Mar 1;25(5):592-8.
doi: 10.1093/bioinformatics/btp015. Epub 2009 Jan 15.

A hierarchical model for incomplete alignments in phylogenetic inference

Affiliations

A hierarchical model for incomplete alignments in phylogenetic inference

Fuxia Cheng et al. Bioinformatics. .

Abstract

Motivation: Full-length DNA and protein sequences that span the entire length of a gene are ideally used for multiple sequence alignments (MSAs) and the subsequent inference of their relationships. Frequently, however, MSAs contain a substantial amount of missing data. For example, expressed sequence tags (ESTs), which are partial sequences of expressed genes, are the predominant source of sequence data for many organisms. The patterns of missing data typical for EST-derived alignments greatly compromise the accuracy of estimated phylogenies.

Results: We present a statistical method for inferring phylogenetic trees from EST-based incomplete MSA data. We propose a class of hierarchical models for modeling pairwise distances between the sequences, and develop a fully Bayesian approach for estimation of the model parameters. Once the distance matrix is estimated, the phylogenetic tree may be constructed by applying neighbor-joining (or any other algorithm of choice). We also show that maximizing the marginal likelihood from the Bayesian approach yields similar results to a profile likelihood estimation. The proposed methods are illustrated using simulated protein families, for which the true phylogeny is known, and one real protein family.

Availability: R code for fitting these models are available from: http://people.bu.edu/gupta/software.htm.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overview of the method for an incomplete alignment example of six sequences (A,…,F). ‘X’ represents any nucleotide (or amino acid), and ‘-’ represents a gap (i.e. missing data): (1) input alignment; (2) overlap graph; (3) assignment of columns to cliques—columns 1–14 are assigned to the green clique, columns 16–25 to the blue clique. Column 15 (red) is tied between the two—it would be assigned to the blue clique; (4) concatenated columns (above) and masked subalignments (below); (5) combination of submatrices and imputation of missing values. Pairwise distances may have been estimated in only one or the other of the submatrices (green or blue), both (yellow) or neither (orange). The values of the yellow cells are estimated by the hierarchical model, while the values of the orange cells must be imputed; (6) the phylogeny inferred by Neighbor Joining.
Fig. 2.
Fig. 2.
(A) Boxplots representing the posterior distribution of β. (B) The 95% highest posterior density intervals for β. The estimates get closer to the truth as the fraction of missing data decreases.
Fig. 3.
Fig. 3.
Unrooted NJ phylogenies estimated for an alignment of a real protein family (serine hydroxymethyltransferase sequences). (A) The phylogeny computed from the complete alignment without missing data. (B) The tree computed from the incomplete alignment without pretreatment. (C) The phylogeny computed from the subdivided alignment using the Bayesian method. Sequences sharing recent common ancestry in (A) are color-coded identically in all trees for easy comparison of major differences in tree topology. For each of the trees shown in (A, B), 100 bootstrap datasets were analyzed. Nodes with support >95% are marked with a black circle. Tree bootstrapping cannot be done for the tree in (C), where the ‘EST-like’ alignment was pretreated with SIA.

Similar articles

Cited by

References

    1. Anderson J. The phylogenetic trunk: maximal inclusion of taxa with missing data in an analysis of the lepospondyli. Syst. Biol. 2001;50:170–193. - PubMed
    1. Benson D, et al. Genbank. Nucleic Acids Res. 2006;34:D16–D20. - PMC - PubMed
    1. Bevan R, et al. Accounting for gene rate heterogeneity in phylogenetic inference. Syst. Biol. 2007;56:194–205. - PubMed
    1. Bininda-Emonds OR. The evolution of supertrees. Trends Ecol. Evol. 2004;19:315–322. - PubMed
    1. Bouck A, Vision TJ. The molecular ecologist's guide to expressed sequence tags. Mol. Ecol. 2007;16:907–924. - PubMed

Publication types