The limits of protein sequence comparison?

William R Pearson¹, Michael L Sierk

Affiliations

PMID: 15919194
PMCID: PMC2845305
DOI: 10.1016/j.sbi.2005.05.005

Review

The limits of protein sequence comparison?

William R Pearson et al. Curr Opin Struct Biol. 2005 Jun.

. 2005 Jun;15(3):254-60.

doi: 10.1016/j.sbi.2005.05.005.

Authors

William R Pearson¹, Michael L Sierk

Affiliation

¹ Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA. wrp@virginia.edu

PMID: 15919194
PMCID: PMC2845305
DOI: 10.1016/j.sbi.2005.05.005

Abstract

Modern sequence alignment algorithms are used routinely to identify homologous proteins, proteins that share a common ancestor. Homologous proteins always share similar structures and often have similar functions. Over the past 20 years, sequence comparison has become both more sensitive, largely because of profile-based methods, and more reliable, because of more accurate statistical estimates. As sequence and structure databases become larger, and comparison methods become more powerful, reliable statistical estimates will become even more important for distinguishing similarities that are due to homology from those that are due to analogy (convergence). The newest sequence alignment methods are more sensitive than older methods, but more accurate statistical estimates are needed for their full power to be realized.

PubMed Disclaimer

Figures

**Figure 1**
Homologs, analogs(?) and convergent evolution. Three-dimensional structures of five serine proteases: **(a)** bovine trypsin (PDB code 5PTP), **(b)** *Streptomyces griseus* trypsin (PDB code 1SGT), **(c)** *S. griseus* protease A (PDB code 2SGA), **(d)** viral serine protease (PDB code 1BEF) and **(e)** subtilisin (PDB code 1SBT). The CATH structure classification places 5PTP, 1SGT and 2SGA in the same homology category, whereas 1BEF has the same topology, but is classified as non-homologous to 5PTP. SCOP places 1BEF in the same superfamily as 5PTP. Subtilisin (1SBT) has a very different structure to the trypsin-like serine proteases and is clearly non-homologous. However, the active sites of subtilisin and trypsin are examples of convergent evolution.

**Figure 2**
Accuracy of statistical estimates. The expected Poisson probability of seeing the reported E()-value versus the observed probability of seeing a domain with a different fold according to CATH (i.e. the domains have different CATH topology classifications) for SSEARCH, PSI-BLAST, COMPASS, DALI and VAST. The E()-values for the highest scoring false-positive (different topology) for each of 86 queries from different CATH homologous superfamilies are shown. The Z-scores reported by DALI were converted into E()-values assuming an extreme value distribution (see [51••] for details). The numbers in parentheses show the number of non-homologs with reported E()<0.001.

**Figure 3**
Homologs found by different search methods. Box plot of the CATH homolog coverage achieved by 86 query domains from different CATH homologous superfamilies under different error criteria for SSEARCH [54], PSI-BLAST [4], COMPASS [19•], DALI [8] and VAST [9]. The upper and lower edges of the boxes are at the 75th and 25th percentile, respectively, with the upper and lower whiskers at the 90th and 10th percentile. The middle line is the median amount of coverage and the circles are the outliers. The fractions of CATH homologs identified at four thresholds are shown: reported E()>0.01 (gray boxes); E()>1 (blue); the first non-homolog according to CATH (red); the first non-topolog (different fold) according to CATH (green).

See this image and copyright information in PMC

References

1. Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci USA. 1983;80:726–730. - PMC - PubMed
1. Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227:1435–1441. - PubMed
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. - PubMed
1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
1. Apweiler R, Bairoch A, Wu CH. Protein sequence databases. Curr Opin Chem Biol. 2004;8:76–80. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The limits of protein sequence comparison?

Affiliation

The limits of protein sequence comparison?

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous