. 2005 Apr 27:6:108.

doi: 10.1186/1471-2105-6-108.

Scoredist: a simple and robust protein sequence distance estimator

Erik L L Sonnhammer¹, Volker Hollich

Affiliations

PMID: 15857510
PMCID: PMC1131889
DOI: 10.1186/1471-2105-6-108

Scoredist: a simple and robust protein sequence distance estimator

Erik L L Sonnhammer et al. BMC Bioinformatics. 2005.

. 2005 Apr 27:6:108.

doi: 10.1186/1471-2105-6-108.

Authors

Erik L L Sonnhammer¹, Volker Hollich

Affiliation

¹ Center for Genomics and Bioinformatics, Karolinska Institutet, Berzelius väg 35, 171 77 Stockholm, Sweden. erik.sonnhammer@cgb.ki.se

PMID: 15857510
PMCID: PMC1131889
DOI: 10.1186/1471-2105-6-108

Abstract

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

PubMed Disclaimer

Figures

**Figure 1**
**Stratified accuracy analysis of *Scoredist* and ML**. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, *Scoredist* diverges less than ML, indicating that *Scoredist* is more robust. The curves show that ML-MV is more different from ML-Dayhoff than *Scoredist*-MV is from *Scoredist*-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.

**Figure 2**
**The Belvu multiple sequence alignment viewer**. Belvu is a multiple sequence alignment viewer that implements the *Scoredist* distance estimator. The alignment window (A) shows a subset of the Pfam family DNA_pol_A (PF00476). Uniprot IDs are shown throughout. A sequence with known structure is included (DPO1_ECOLI) – the SA line showing surface accessibility and the SS line showing secondary structure. The neighbour-joining tree in (B) used uncorrected distances (observed differences), while the tree in (C) used *Scoredist* correction. Belvu assigns a colour to each species if provided with species markup information. The distance correction mainly affects the longer branches, and affects the tree topology in some cases, *e.g*. the placement of DPOQ_HUMAN. Structural markup and taxonomic information were embedded in the Stockholm format alignment provided by the Pfam database.

**Figure 3**
Estimation of the calibration factor c in *Scoredist*. This factor rescales the raw distance d_rto optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances d_rto true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance d_rand the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to c_Dayhoff= 1.3370 (See Table 2).

See this image and copyright information in PMC

Cited by

Towards a practical O(nlogn) phylogeny algorithm.
Truszkowski J, Hao Y, Brown DG. Truszkowski J, et al. Algorithms Mol Biol. 2012 Nov 26;7(1):32. doi: 10.1186/1748-7188-7-32. Algorithms Mol Biol. 2012. PMID: 23181935 Free PMC article.
Adaptive evolution has targeted the C-terminal domain of the RXLR effectors of plant pathogenic oomycetes.
Win J, Morgan W, Bos J, Krasileva KV, Cano LM, Chaparro-Garcia A, Ammar R, Staskawicz BJ, Kamoun S. Win J, et al. Plant Cell. 2007 Aug;19(8):2349-69. doi: 10.1105/tpc.107.051037. Epub 2007 Aug 3. Plant Cell. 2007. PMID: 17675403 Free PMC article.
Isopentenyltransferase-1 (IPT1) knockout in Physcomitrella together with phylogenetic analyses of IPTs provide insights into evolution of plant cytokinin biosynthesis.
Lindner AC, Lang D, Seifert M, Podlešáková K, Novák O, Strnad M, Reski R, von Schwartzenberg K. Lindner AC, et al. J Exp Bot. 2014 Jun;65(9):2533-43. doi: 10.1093/jxb/eru142. Epub 2014 Apr 1. J Exp Bot. 2014. PMID: 24692654 Free PMC article.
The evolution of nuclear auxin signalling.
Paponov IA, Teale W, Lang D, Paponov M, Reski R, Rensing SA, Palme K. Paponov IA, et al. BMC Evol Biol. 2009 Jun 3;9:126. doi: 10.1186/1471-2148-9-126. BMC Evol Biol. 2009. PMID: 19493348 Free PMC article.
Analysis of genome sequences from plant pathogenic Rhodococcus reveals genetic novelties in virulence loci.
Creason AL, Vandeputte OM, Savory EA, Davis EW 2nd, Putnam ML, Hu E, Swader-Hines D, Mol A, Baucher M, Prinsen E, Zdanowska M, Givan SA, El Jaziri M, Loper JE, Mahmud T, Chang JH. Creason AL, et al. PLoS One. 2014 Jul 10;9(7):e101996. doi: 10.1371/journal.pone.0101996. eCollection 2014. PLoS One. 2014. PMID: 25010934 Free PMC article.

See all "Cited by" articles

References

1. Bruno WJ, Socci ND, Halpern AL. Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction. Mol Biol Evol. 2000;17:189–197. - PubMed
1. Gascuel O. BIONJ: An Improved Version on the NJ Algorithm Based on a Simple Model of Sequence Data. Mol Biol Evol. 1997;14:685–695. - PubMed
1. Saitou N, Nei M. The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol Biol Evol. 1987;4:406–425. - PubMed
1. Zmasek C, Eddy S. RIO: analyzing proteomes by automated phylogenenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. - DOI - PMC - PubMed
1. Hollich V, Storm CE, Sonnhammer ELL. OrthoGUI: graphical presentation of Orthostrapper results. Bioinformatics. 2002;18:1272–1273. doi: 10.1093/bioinformatics/18.9.1272. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Scoredist: a simple and robust protein sequence distance estimator

Affiliation

Scoredist: a simple and robust protein sequence distance estimator

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Related information

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous