Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Apr 27:6:108.
doi: 10.1186/1471-2105-6-108.

Scoredist: a simple and robust protein sequence distance estimator

Affiliations

Scoredist: a simple and robust protein sequence distance estimator

Erik L L Sonnhammer et al. BMC Bioinformatics. .

Abstract

Background: Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.

Results: We propose a correction-based protein sequence estimator called Scoredist. It uses a logarithmic correction of observed divergence based on the alignment score according to the BLOSUM62 score matrix. We evaluated Scoredist and a number of optimal matrix methods using three evolutionary models for both training and testing Dayhoff, Jones-Taylor-Thornton, and Muller-Vingron, as well as Whelan and Goldman solely for testing. Test alignments with known distances between 0.01 and 2 substitutions per position (1-200 PAM) were simulated using ROSE. Scoredist proved as accurate as the optimal matrix methods, yet substantially more robust. When trained on one model but tested on another one, Scoredist was nearly always more accurate. The Jukes-Cantor and Kimura correction methods were also tested, but were substantially less accurate.

Conclusion: The Scoredist distance estimator is fast to implement and run, and combines robustness with accuracy. Scoredist has been incorporated into the Belvu alignment viewer, which is available at ftp://ftp.cgb.ki.se/pub/prog/belvu/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Stratified accuracy analysis of Scoredist and ML. To illustrate how estimated distance depends on the model, the average deviation is plotted as a function of true distance for two evolutionary models, Dayhoff and Mueller-Vingron. For each evolutionary distance between 1 and 200 PAM, 10 alignments were generated. For each alignment, the deviation was calculated as the difference between the estimated distance and the true distance used for data generation by ROSE [16]. The average of the 10 deviations was plotted using a running average with a window of 10 residues. Note that positive and negative deviations at the same true distance can cancel each other out – the curve only shows the average deviation and not the variability. The values in Table 1 measure the accuracy more correctly by using RMSD of every datapoint. The testset data was created with the matrices given by Dayhoff (A) or Müller-Vingron (B). In both cases, the estimators using the same evolutionary model as the testset data perform well. However, when switching the model in the estimator, Scoredist diverges less than ML, indicating that Scoredist is more robust. The curves show that ML-MV is more different from ML-Dayhoff than Scoredist-MV is from Scoredist-Dayhoff, particularly for the MV dataset in (B). The less difference between estimates using different models, the more robust is the method.
Figure 2
Figure 2
The Belvu multiple sequence alignment viewer. Belvu is a multiple sequence alignment viewer that implements the Scoredist distance estimator. The alignment window (A) shows a subset of the Pfam family DNA_pol_A (PF00476). Uniprot IDs are shown throughout. A sequence with known structure is included (DPO1_ECOLI) – the SA line showing surface accessibility and the SS line showing secondary structure. The neighbour-joining tree in (B) used uncorrected distances (observed differences), while the tree in (C) used Scoredist correction. Belvu assigns a colour to each species if provided with species markup information. The distance correction mainly affects the longer branches, and affects the tree topology in some cases, e.g. the placement of DPOQ_HUMAN. Structural markup and taxonomic information were embedded in the Stockholm format alignment provided by the Pfam database.
Figure 3
Figure 3
Estimation of the calibration factor c in Scoredist. This factor rescales the raw distance dr to optimally fit true evolutionary distances. The plot shows how c is estimated by least-squares fitting of raw distances dr to true distances for 2000 artificially produced sequence alignments, using the Dayhoff matrix series. The linear relationship between the raw distance dr and the true distance of the sequence samples justifies the introduction of the calibration factor c, which was here determined to cDayhoff = 1.3370 (See Table 2).

Similar articles

Cited by

References

    1. Bruno WJ, Socci ND, Halpern AL. Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction. Mol Biol Evol. 2000;17:189–197. - PubMed
    1. Gascuel O. BIONJ: An Improved Version on the NJ Algorithm Based on a Simple Model of Sequence Data. Mol Biol Evol. 1997;14:685–695. - PubMed
    1. Saitou N, Nei M. The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol Biol Evol. 1987;4:406–425. - PubMed
    1. Zmasek C, Eddy S. RIO: analyzing proteomes by automated phylogenenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. - DOI - PMC - PubMed
    1. Hollich V, Storm CE, Sonnhammer ELL. OrthoGUI: graphical presentation of Orthostrapper results. Bioinformatics. 2002;18:1272–1273. doi: 10.1093/bioinformatics/18.9.1272. - DOI - PubMed