. 2008 Jun 23:8:179.

doi: 10.1186/1471-2148-8-179.

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Christophe Dessimoz¹, Manuel Gil

Affiliations

PMID: 18573206
PMCID: PMC2443136
DOI: 10.1186/1471-2148-8-179

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Christophe Dessimoz et al. BMC Evol Biol. 2008.

. 2008 Jun 23:8:179.

doi: 10.1186/1471-2148-8-179.

Authors

Christophe Dessimoz¹, Manuel Gil

Affiliation

¹ Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland. cdessimoz@inf.ethz.ch

PMID: 18573206
PMCID: PMC2443136
DOI: 10.1186/1471-2148-8-179

Abstract

Background: The estimation of a distance between two biological sequences is a fundamental process in molecular evolution. It is usually performed by maximum likelihood (ML) on characters aligned either pairwise or jointly in a multiple sequence alignment (MSA). Estimators for the covariance of pairs from an MSA are known, but we are not aware of any solution for cases of pairs aligned independently. In large-scale analyses, it may be too costly to compute MSAs every time distances must be compared, and therefore a covariance estimator for distances estimated from pairs aligned independently is desirable. Knowledge of covariances improves any process that compares or combines distances, such as in generalized least-squares phylogenetic tree building, orthology inference, or lateral gene transfer detection.

Results: In this paper, we introduce an estimator for the covariance of distances from sequences aligned pairwise. Its performance is analyzed through extensive Monte Carlo simulations, and compared to the well-known variance estimator of ML distances. Our covariance estimator can be used together with the ML variance estimator to form covariance matrices.

Conclusion: The estimator performs similarly to the ML variance estimator. In particular, it shows no sign of bias when sequence divergence is below 150 PAM units (i.e. above ~29% expected sequence identity). Above that distance, the covariances tend to be underestimated, but then ML variances are also underestimated.

PubMed Disclaimer

Figures

**Figure 1**
**Overview of approaches to estimate evolutionary distances and their covariances**. A set of n sequences can be aligned jointly to obtain an MSA or in a pairwise optimal manner resulting in $(\begin{matrix} n \\ 2 \end{matrix})$ optimal pairwise alignments (OPAs). Given a hypothesis of character homology, distance estimation per ML can essentially be done in two ways: jointly on a tree or pairwise. In the first case a tree's branch-lengths are estimated simultaneously. This requires an MSA. In the second case pairwise distances are estimated either from MSA induced pairwise alignments (IPAs) or from the OPAs. The distance estimators are afflicted with an error expressed by their variances and covariances. In all cases, the covariances can be modeled as a function of shared branch lengths, but this requires a phylogenetic tree. When distances are estimated based on an MSA, the variances and covariances can be obtained from ML theory or by bootstrapping over the MSA's columns. In the case of OPAs, these techniques cannot be directly applied (see *Methods*). We have previously presented a covariance estimator for the case where the two OPAs in question share a sequence (i.e. for triplets). In this paper, we introduce an estimator for the general case.

**Figure 2**
**Possible topological relations of sequences**. For two pairwise distances, one can distinguish three possible underlying topological configurations relating them. If they are estimated between four sequences, there are two possible configurations. Either they share some common evolution (a) or they are independent (b). In the third configuration, the two distances are estimated from two OPAs that share a sequence (c).

**Figure 3**
**Comparison of the covariance estimator and the ML variance estimator with their Monte Carlo counterparts**. Error-bars indicate 95% confidence intervals. a) Monte Carlo covariance estimator vs. average of the covariance estimator for sequence lengths of {200, 500, 800} AA. In the dependence case, the estimator appears unbiased in most cases. In the independence case, the estimator shows a slight upward bias, but the absolute values are close to zero. In the triplet case, a downward bias with increasing covariance is visible. b) Monte Carlo variance estimator vs. average of ML variance estimator. A downward bias with increasing variance is visible.

**Figure 4**
**Bias and standard deviation of the covariance and ML variance estimators**. Average percentage of anchors vs. bias and standard deviation of the covariance estimator for sequence length of 500 AA. Error-bars indicate the 95% confidence intervals. The bias increases with decreasing fraction of anchors. The bias is smaller than the standard deviation when percentage of anchors is greater than 65% (dependence), 80% (triplet) and 75% (ML variance).

**Figure 5**
**Relation between distance and percentage of anchors**. Horizontal axis: Average of the two distances for which the covariance has been estimated. Vertical axis: Average percentage of anchors. The *Quartet* labels refer to the dependence case. The fraction of anchors decreases with increasing distance.

**Figure 6**
**Relative error of covariance matrix**. Average relative error of variance matrices and variance/covariance matrices for a sequence length of 500 AA. Dependence and independence cases: Variance matrices and variance-covariance matrices have comparable error. Triplet case: Variance-covariance matrices have lower error.

**Figure 7**
**Example of anchors**. The six pairwise alignments of four example sequences (left) and the corresponding graph-representation (right). The consistent positions are in bold.

See this image and copyright information in PMC

Cited by

The evolutionary rate dynamically tracks changes in HIV-1 epidemics: application of a simple method for optimizing the evolutionary rate in phylogenetic trees with longitudinal data.
Maljkovic Berry I, Athreya G, Kothari M, Daniels M, Bruno WJ, Korber B, Kuiken C, Ribeiro RM, Leitner T. Maljkovic Berry I, et al. Epidemics. 2009 Dec;1(4):230-9. doi: 10.1016/j.epidem.2009.10.003. Epub 2009 Nov 12. Epidemics. 2009. PMID: 21352769 Free PMC article.
ALF--a simulation framework for genome evolution.
Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. Dalquen DA, et al. Mol Biol Evol. 2012 Apr;29(4):1115-23. doi: 10.1093/molbev/msr268. Epub 2011 Dec 8. Mol Biol Evol. 2012. PMID: 22160766 Free PMC article.
Fast and accurate estimation of the covariance between pairwise maximum likelihood distances.
Gil M. Gil M. PeerJ. 2014 Sep 25;2:e583. doi: 10.7717/peerj.583. eCollection 2014. PeerJ. 2014. PMID: 25279263 Free PMC article.

References

1. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. - DOI - PubMed
1. Mewes HW, Amid C, Arnold R, Frishman D, Guldener U, Mannhaupt G, Munsterkotter M, Pagel P, Strack N, Stumpflen V, Warfsmann J, Ruepp A. MIPS: analysis and annotation of proteins from whole genomes. Nucl Acids Res. 2004;32(suppl 1):D41–44. doi: 10.1093/nar/gkh092. - DOI - PMC - PubMed
1. Dessimoz C, Cannarozzi G, Gil M, Margadant D, Roth A, Schneider A, Gonnet G. In: RECOMB 2005 Workshop on Comparative Genomics, Volume LNBI 3678 of Lecture Notes in Bioinformatics. McLysath A, Huson DH, editor. Springer-Verlag; 2005. OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements; pp. 61–72.
1. DeLuca TF, Wu IH, Pu J, Monaghan T, Peshkin L, Singh S, Wall DP. Roundup: a multi-genome repository of orthologs and evolutionary distances. Bioinformatics. 2006;22(16):2044–2046. doi: 10.1093/bioinformatics/btl286. - DOI - PubMed
1. Cavalli-Sforza LL, Edwards AWF. Phylogenetic analysis: models and estimation procedures. Evolution. 1967;21:550–570. doi: 10.2307/2406616. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Affiliation

Covariance of maximum likelihood evolutionary distances between sequences aligned pairwise

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous