Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Jun 23:8:179.
doi: 10.1186/1471-2148-8-179.

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Affiliations

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Christophe Dessimoz et al. BMC Evol Biol. .

Abstract

Background: The estimation of a distance between two biological sequences is a fundamental process in molecular evolution. It is usually performed by maximum likelihood (ML) on characters aligned either pairwise or jointly in a multiple sequence alignment (MSA). Estimators for the covariance of pairs from an MSA are known, but we are not aware of any solution for cases of pairs aligned independently. In large-scale analyses, it may be too costly to compute MSAs every time distances must be compared, and therefore a covariance estimator for distances estimated from pairs aligned independently is desirable. Knowledge of covariances improves any process that compares or combines distances, such as in generalized least-squares phylogenetic tree building, orthology inference, or lateral gene transfer detection.

Results: In this paper, we introduce an estimator for the covariance of distances from sequences aligned pairwise. Its performance is analyzed through extensive Monte Carlo simulations, and compared to the well-known variance estimator of ML distances. Our covariance estimator can be used together with the ML variance estimator to form covariance matrices.

Conclusion: The estimator performs similarly to the ML variance estimator. In particular, it shows no sign of bias when sequence divergence is below 150 PAM units (i.e. above ~29% expected sequence identity). Above that distance, the covariances tend to be underestimated, but then ML variances are also underestimated.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of approaches to estimate evolutionary distances and their covariances. A set of n sequences can be aligned jointly to obtain an MSA or in a pairwise optimal manner resulting in (n2) optimal pairwise alignments (OPAs). Given a hypothesis of character homology, distance estimation per ML can essentially be done in two ways: jointly on a tree or pairwise. In the first case a tree's branch-lengths are estimated simultaneously. This requires an MSA. In the second case pairwise distances are estimated either from MSA induced pairwise alignments (IPAs) or from the OPAs. The distance estimators are afflicted with an error expressed by their variances and covariances. In all cases, the covariances can be modeled as a function of shared branch lengths, but this requires a phylogenetic tree. When distances are estimated based on an MSA, the variances and covariances can be obtained from ML theory or by bootstrapping over the MSA's columns. In the case of OPAs, these techniques cannot be directly applied (see Methods). We have previously presented a covariance estimator for the case where the two OPAs in question share a sequence (i.e. for triplets). In this paper, we introduce an estimator for the general case.
Figure 2
Figure 2
Possible topological relations of sequences. For two pairwise distances, one can distinguish three possible underlying topological configurations relating them. If they are estimated between four sequences, there are two possible configurations. Either they share some common evolution (a) or they are independent (b). In the third configuration, the two distances are estimated from two OPAs that share a sequence (c).
Figure 3
Figure 3
Comparison of the covariance estimator and the ML variance estimator with their Monte Carlo counterparts. Error-bars indicate 95% confidence intervals. a) Monte Carlo covariance estimator vs. average of the covariance estimator for sequence lengths of {200, 500, 800} AA. In the dependence case, the estimator appears unbiased in most cases. In the independence case, the estimator shows a slight upward bias, but the absolute values are close to zero. In the triplet case, a downward bias with increasing covariance is visible. b) Monte Carlo variance estimator vs. average of ML variance estimator. A downward bias with increasing variance is visible.
Figure 4
Figure 4
Bias and standard deviation of the covariance and ML variance estimators. Average percentage of anchors vs. bias and standard deviation of the covariance estimator for sequence length of 500 AA. Error-bars indicate the 95% confidence intervals. The bias increases with decreasing fraction of anchors. The bias is smaller than the standard deviation when percentage of anchors is greater than 65% (dependence), 80% (triplet) and 75% (ML variance).
Figure 5
Figure 5
Relation between distance and percentage of anchors. Horizontal axis: Average of the two distances for which the covariance has been estimated. Vertical axis: Average percentage of anchors. The Quartet labels refer to the dependence case. The fraction of anchors decreases with increasing distance.
Figure 6
Figure 6
Relative error of covariance matrix. Average relative error of variance matrices and variance/covariance matrices for a sequence length of 500 AA. Dependence and independence cases: Variance matrices and variance-covariance matrices have comparable error. Triplet case: Variance-covariance matrices have lower error.
Figure 7
Figure 7
Example of anchors. The six pairwise alignments of four example sequences (left) and the corresponding graph-representation (right). The consistent positions are in bold.

Similar articles

Cited by

References

    1. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. - DOI - PubMed
    1. Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A. MIPS: analysis and annotation of proteins from whole genomes. Nucl Acids Res. 2004;32(suppl 1):D41–44. doi: 10.1093/nar/gkh092. - DOI - PMC - PubMed
    1. Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G. In: RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. McLysath A, Huson DH, editor. Springer-Verlag; 2005. OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements; pp. 61–72.
    1. DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006;22(16):2044–2046. doi: 10.1093/bioinformatics/btl286. - DOI - PubMed
    1. Cavalli-Sforza LL, Edwards AWF. Phylogenetic analysis: models and estimation procedures. Evolution. 1967;21:550–570. doi: 10.2307/2406616. - DOI - PubMed

LinkOut - more resources