A comparison of position-specific score matrices based on sequence and structure alignments

Anna R Panchenko¹, Stephen H Bryant

Affiliations

Affiliation

¹ Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.

PMID: 11790846
PMCID: PMC2373449
DOI: 10.1110/ps.19902

Comparative Study

A comparison of position-specific score matrices based on sequence and structure alignments

Anna R Panchenko et al. Protein Sci. 2002 Feb.

. 2002 Feb;11(2):361-70.

doi: 10.1110/ps.19902.

Authors

Anna R Panchenko¹, Stephen H Bryant

Affiliation

¹ Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.

PMID: 11790846
PMCID: PMC2373449
DOI: 10.1110/ps.19902

Abstract

Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.

PubMed Disclaimer

Figures

**Fig. 1.**
Average contact specificity for molecular models predicted by structure–structure alignments of test-set sequences with structurally similar neighbors. The neighbors included in the averages are (as in Fig. 2 ▶) those detected by PSSMs from all four seed alignment methods we consider, with test-set domains grouped according to ranges of seed-alignment diversity. Results are shown as a boxplot (Chambers 1998), displaying the range of contact specificity values observed for each seed-alignment diversity range. The central line in each box shows the median contact specificity, the upper and lower boundaries of the box show the upper and lower quartiles, and the vertical lines extend to a value 1.5 times the interquartile range. Outlier values beyond these ranges are shown as individual points.

**Fig. 2.**
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. PSSMs are calculated from seed alignments by VAST (a), BLAST (b), ClustalW (c), and ClustalW-pairwise (d). Test-set domains are grouped into ranges of seed-alignment diversity, based on average pairwise identity among all sequences in the seed, calculated via the VAST alignment of each sequence to the test-set domain. For purposes of comparison between different methods contact specificity is averaged only over those neighbor sequences identified with PSI-BLAST E-value < 0.01 by all four types of PSSM.

**Fig. 3.**
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. Contact specificity is averaged separately over all models predicted by sequence-PSSM alignments from BLAST, ClustalW, and ClustalW-pairwise seed alignments. These values are plotted against the average contact specificity for models of the same test-set domain predicted by sequence-PSSM alignments from VAST seed alignments.

**Fig. 4.**
PSSM recognition sensitivity for ranges of average percent identity (a) and average number of independent observations (b) of sequences in seed alignments. Each bar represents the mean recognition rate for PSSMs based on seed alignments by the methods indicated, for the indicated range of seed-alignment diversity. Domains in the test set are listed by their PDB code (lower case), chain identifier (if applicable, upper case) and domain identifiers (numeric, starting with 1 for each chain): 1a2zA, 1a66A, 1a6cA1, 1a8z, 1a96C, 1aac, 1aazA, 1abe 2, 1abrB2, 1adoA, 1afj, 1ah1, 1aizA, 1aj0, 1ajsA2, 1ak5, 1aozA1, 1aq0A1, 1ash, 1atzB, 1auyB, 1auz, 1av6A2, 1avc 1, 1aw5 1, 1aym3, 1be1, 1bebA, 1bf5A2, 1ble, 1bmdA1, 1bmtA2, 1bmv1, 1bmv21, 1bmv22, 1bovA, 1boy 1, 1boy 2, 1bp3B1, 1bp3B2, 1bquB1, 1bquB2, 1bslB, 1c25, 1cdh, 1cen, 1cfb 1, 1cfb 2, 1cpcB1, 1ctn 1, 1cto, 1cwpB, 1dcpC, 1dhr, 1din, 1dpgA1, 1dpmA, 1e2b, 1eayD, 1ebpA1, 1ebpA2, 1eca, 1eceA, 1edg, 1edhA2, 1eft 1, 1efvA1, 1efvB1, 1epaB, 1epnE2, 1f13A1, 1f13A4, 1fem, 1fivA, 1fmtA1, 1fnf 2, 1fnf 3,1fod1, 1fts 2, 1grx, 1hbg, 1hcd, 1hjrA, 1hnf 1, 1hnf 2, 1hoe, 1hstA, 1IdaA, 1itbB1, 1itbB3, 1ithA, 1jdbK5, 1jdbK8, 1jer, 1jli, 1jlxA1, 1jlxA2, 1jrhI, 1kb5B, 1ksr, 1kte, 1lea, 1lki, 1nal11, 1neu, 1nfkA2, 1occB1, 1ofgA1, 1opc 2, 1ordA1, 1pamA3, 1pdo, 1pii 1, 1pii 2, 1pnt, 1pysB7, 1qapA2, 1rcb, 1rhoA, 1ris, 1rvv1, 1scuA1, 1scuB3, 1sfe 2, 1sftA2, 1soxA3, 1sro, 1stmA, 1svb 3, 1tbgA2, 1tde 2, 1tdj 2, 1ten, 1uag 1, 1uag 3, 1udiI, 1vcaA1, 1vcaA2, 1wab 1, 1who, 1xan 3, 1xbrA, 1yub 1, 1yveI1, 1zxq 1, 1zxq 2, 2awo, 2dldA2, 2dri 2, 2fmr, 2gdm, 2gmfA, 2I1b, 2ila, 2mnr 2, 2ncm, 2pgd 1, 2pii 1, 2rspA, 2sas 2, 2stv 1, 2tmdA3, 2trxA, 2u1a, 2wbc, 3btoA2, 3chy, 3inkC, 3ullA, 5p21, 5ptp, 1tde 1.

**Fig. 5.**
Relative recognition rates for PSSMs derived from different seed alignments for ranges of average percent identity (a) and average number of independent observations (b). Each bar represents a ratio of the number of structure neighbors recognized and the number of structure neighbors one might expect to recognize for each seed-alignment diversity range (see Results section).

See this image and copyright information in PMC

References

1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
1. Aravind, L. and Koonin, E.V. 1999. Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol. 287 1023–1040. - PubMed
1. Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. (Suppl.) 7 957–959. - PubMed
1. Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95 6073–6078. - PMC - PubMed
1. Chambers, J.M. (1998). Programming with data. A guide to the S language. Springer-Verlag, New York.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A comparison of position-specific score matrices based on sequence and structure alignments

Affiliation

A comparison of position-specific score matrices based on sequence and structure alignments

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials