Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2002 Feb;11(2):361-70.
doi: 10.1110/ps.19902.

A comparison of position-specific score matrices based on sequence and structure alignments

Affiliations
Comparative Study

A comparison of position-specific score matrices based on sequence and structure alignments

Anna R Panchenko et al. Protein Sci. 2002 Feb.

Abstract

Sequence comparison methods based on position-specific score matrices (PSSMs) have proven a useful tool for recognition of the divergent members of a protein family and for annotation of functional sites. Here we investigate one of the factors that affects overall performance of PSSMs in a PSI-BLAST search, the algorithm used to construct the seed alignment upon which the PSSM is based. We compare PSSMs based on alignments constructed by global sequence similarity (ClustalW and ClustalW-pairwise), local sequence similarity (BLAST), and local structure similarity (VAST). To assess performance with respect to identification of conserved functional or structural sites, we examine the accuracy of the three-dimensional molecular models predicted by PSSM-sequence alignments. Using the known structures of those sequences as the standard of truth, we find that model accuracy varies with the algorithm used for seed alignment construction in the pattern local-structure (VAST) > local-sequence (BLAST) > global-sequence (ClustalW). Using structural similarity of query and database proteins as the standard of truth, we find that PSSM recognition sensitivity depends primarily on the diversity of the sequences included in the alignment, with an optimum around 30-50% average pairwise identity. We discuss these observations, and suggest a strategy for constructing seed alignments that optimize PSSM-sequence alignment accuracy and recognition sensitivity.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Average contact specificity for molecular models predicted by structure–structure alignments of test-set sequences with structurally similar neighbors. The neighbors included in the averages are (as in Fig. 2 ▶) those detected by PSSMs from all four seed alignment methods we consider, with test-set domains grouped according to ranges of seed-alignment diversity. Results are shown as a boxplot (Chambers 1998), displaying the range of contact specificity values observed for each seed-alignment diversity range. The central line in each box shows the median contact specificity, the upper and lower boundaries of the box show the upper and lower quartiles, and the vertical lines extend to a value 1.5 times the interquartile range. Outlier values beyond these ranges are shown as individual points.
Fig. 2.
Fig. 2.
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. PSSMs are calculated from seed alignments by VAST (a), BLAST (b), ClustalW (c), and ClustalW-pairwise (d). Test-set domains are grouped into ranges of seed-alignment diversity, based on average pairwise identity among all sequences in the seed, calculated via the VAST alignment of each sequence to the test-set domain. For purposes of comparison between different methods contact specificity is averaged only over those neighbor sequences identified with PSI-BLAST E-value < 0.01 by all four types of PSSM.
Fig. 2.
Fig. 2.
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. PSSMs are calculated from seed alignments by VAST (a), BLAST (b), ClustalW (c), and ClustalW-pairwise (d). Test-set domains are grouped into ranges of seed-alignment diversity, based on average pairwise identity among all sequences in the seed, calculated via the VAST alignment of each sequence to the test-set domain. For purposes of comparison between different methods contact specificity is averaged only over those neighbor sequences identified with PSI-BLAST E-value < 0.01 by all four types of PSSM.
Fig. 2.
Fig. 2.
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. PSSMs are calculated from seed alignments by VAST (a), BLAST (b), ClustalW (c), and ClustalW-pairwise (d). Test-set domains are grouped into ranges of seed-alignment diversity, based on average pairwise identity among all sequences in the seed, calculated via the VAST alignment of each sequence to the test-set domain. For purposes of comparison between different methods contact specificity is averaged only over those neighbor sequences identified with PSI-BLAST E-value < 0.01 by all four types of PSSM.
Fig. 2.
Fig. 2.
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. PSSMs are calculated from seed alignments by VAST (a), BLAST (b), ClustalW (c), and ClustalW-pairwise (d). Test-set domains are grouped into ranges of seed-alignment diversity, based on average pairwise identity among all sequences in the seed, calculated via the VAST alignment of each sequence to the test-set domain. For purposes of comparison between different methods contact specificity is averaged only over those neighbor sequences identified with PSI-BLAST E-value < 0.01 by all four types of PSSM.
Fig. 3.
Fig. 3.
Average contact specificity for molecular models predicted by PSSM-sequence alignments of test-set domains with structurally similar neighbors. Contact specificity is averaged separately over all models predicted by sequence-PSSM alignments from BLAST, ClustalW, and ClustalW-pairwise seed alignments. These values are plotted against the average contact specificity for models of the same test-set domain predicted by sequence-PSSM alignments from VAST seed alignments.
Fig. 4.
Fig. 4.
PSSM recognition sensitivity for ranges of average percent identity (a) and average number of independent observations (b) of sequences in seed alignments. Each bar represents the mean recognition rate for PSSMs based on seed alignments by the methods indicated, for the indicated range of seed-alignment diversity. Domains in the test set are listed by their PDB code (lower case), chain identifier (if applicable, upper case) and domain identifiers (numeric, starting with 1 for each chain): 1a2zA, 1a66A, 1a6cA1, 1a8z, 1a96C, 1aac, 1aazA, 1abe 2, 1abrB2, 1adoA, 1afj, 1ah1, 1aizA, 1aj0, 1ajsA2, 1ak5, 1aozA1, 1aq0A1, 1ash, 1atzB, 1auyB, 1auz, 1av6A2, 1avc 1, 1aw5 1, 1aym3, 1be1, 1bebA, 1bf5A2, 1ble, 1bmdA1, 1bmtA2, 1bmv1, 1bmv21, 1bmv22, 1bovA, 1boy 1, 1boy 2, 1bp3B1, 1bp3B2, 1bquB1, 1bquB2, 1bslB, 1c25, 1cdh, 1cen, 1cfb 1, 1cfb 2, 1cpcB1, 1ctn 1, 1cto, 1cwpB, 1dcpC, 1dhr, 1din, 1dpgA1, 1dpmA, 1e2b, 1eayD, 1ebpA1, 1ebpA2, 1eca, 1eceA, 1edg, 1edhA2, 1eft 1, 1efvA1, 1efvB1, 1epaB, 1epnE2, 1f13A1, 1f13A4, 1fem, 1fivA, 1fmtA1, 1fnf 2, 1fnf 3,1fod1, 1fts 2, 1grx, 1hbg, 1hcd, 1hjrA, 1hnf 1, 1hnf 2, 1hoe, 1hstA, 1IdaA, 1itbB1, 1itbB3, 1ithA, 1jdbK5, 1jdbK8, 1jer, 1jli, 1jlxA1, 1jlxA2, 1jrhI, 1kb5B, 1ksr, 1kte, 1lea, 1lki, 1nal11, 1neu, 1nfkA2, 1occB1, 1ofgA1, 1opc 2, 1ordA1, 1pamA3, 1pdo, 1pii 1, 1pii 2, 1pnt, 1pysB7, 1qapA2, 1rcb, 1rhoA, 1ris, 1rvv1, 1scuA1, 1scuB3, 1sfe 2, 1sftA2, 1soxA3, 1sro, 1stmA, 1svb 3, 1tbgA2, 1tde 2, 1tdj 2, 1ten, 1uag 1, 1uag 3, 1udiI, 1vcaA1, 1vcaA2, 1wab 1, 1who, 1xan 3, 1xbrA, 1yub 1, 1yveI1, 1zxq 1, 1zxq 2, 2awo, 2dldA2, 2dri 2, 2fmr, 2gdm, 2gmfA, 2I1b, 2ila, 2mnr 2, 2ncm, 2pgd 1, 2pii 1, 2rspA, 2sas 2, 2stv 1, 2tmdA3, 2trxA, 2u1a, 2wbc, 3btoA2, 3chy, 3inkC, 3ullA, 5p21, 5ptp, 1tde 1.
Fig. 4.
Fig. 4.
PSSM recognition sensitivity for ranges of average percent identity (a) and average number of independent observations (b) of sequences in seed alignments. Each bar represents the mean recognition rate for PSSMs based on seed alignments by the methods indicated, for the indicated range of seed-alignment diversity. Domains in the test set are listed by their PDB code (lower case), chain identifier (if applicable, upper case) and domain identifiers (numeric, starting with 1 for each chain): 1a2zA, 1a66A, 1a6cA1, 1a8z, 1a96C, 1aac, 1aazA, 1abe 2, 1abrB2, 1adoA, 1afj, 1ah1, 1aizA, 1aj0, 1ajsA2, 1ak5, 1aozA1, 1aq0A1, 1ash, 1atzB, 1auyB, 1auz, 1av6A2, 1avc 1, 1aw5 1, 1aym3, 1be1, 1bebA, 1bf5A2, 1ble, 1bmdA1, 1bmtA2, 1bmv1, 1bmv21, 1bmv22, 1bovA, 1boy 1, 1boy 2, 1bp3B1, 1bp3B2, 1bquB1, 1bquB2, 1bslB, 1c25, 1cdh, 1cen, 1cfb 1, 1cfb 2, 1cpcB1, 1ctn 1, 1cto, 1cwpB, 1dcpC, 1dhr, 1din, 1dpgA1, 1dpmA, 1e2b, 1eayD, 1ebpA1, 1ebpA2, 1eca, 1eceA, 1edg, 1edhA2, 1eft 1, 1efvA1, 1efvB1, 1epaB, 1epnE2, 1f13A1, 1f13A4, 1fem, 1fivA, 1fmtA1, 1fnf 2, 1fnf 3,1fod1, 1fts 2, 1grx, 1hbg, 1hcd, 1hjrA, 1hnf 1, 1hnf 2, 1hoe, 1hstA, 1IdaA, 1itbB1, 1itbB3, 1ithA, 1jdbK5, 1jdbK8, 1jer, 1jli, 1jlxA1, 1jlxA2, 1jrhI, 1kb5B, 1ksr, 1kte, 1lea, 1lki, 1nal11, 1neu, 1nfkA2, 1occB1, 1ofgA1, 1opc 2, 1ordA1, 1pamA3, 1pdo, 1pii 1, 1pii 2, 1pnt, 1pysB7, 1qapA2, 1rcb, 1rhoA, 1ris, 1rvv1, 1scuA1, 1scuB3, 1sfe 2, 1sftA2, 1soxA3, 1sro, 1stmA, 1svb 3, 1tbgA2, 1tde 2, 1tdj 2, 1ten, 1uag 1, 1uag 3, 1udiI, 1vcaA1, 1vcaA2, 1wab 1, 1who, 1xan 3, 1xbrA, 1yub 1, 1yveI1, 1zxq 1, 1zxq 2, 2awo, 2dldA2, 2dri 2, 2fmr, 2gdm, 2gmfA, 2I1b, 2ila, 2mnr 2, 2ncm, 2pgd 1, 2pii 1, 2rspA, 2sas 2, 2stv 1, 2tmdA3, 2trxA, 2u1a, 2wbc, 3btoA2, 3chy, 3inkC, 3ullA, 5p21, 5ptp, 1tde 1.
Fig. 5.
Fig. 5.
Relative recognition rates for PSSMs derived from different seed alignments for ranges of average percent identity (a) and average number of independent observations (b). Each bar represents a ratio of the number of structure neighbors recognized and the number of structure neighbors one might expect to recognize for each seed-alignment diversity range (see Results section).
Fig. 5.
Fig. 5.
Relative recognition rates for PSSMs derived from different seed alignments for ranges of average percent identity (a) and average number of independent observations (b). Each bar represents a ratio of the number of structure neighbors recognized and the number of structure neighbors one might expect to recognize for each seed-alignment diversity range (see Results section).

Similar articles

Cited by

References

    1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25 3389–3402. - PMC - PubMed
    1. Aravind, L. and Koonin, E.V. 1999. Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J. Mol. Biol. 287 1023–1040. - PubMed
    1. Berman, H.M., Bhat, T.N., Bourne, P.E., Feng, Z., Gilliland, G., Weissig, H., and Westbrook, J. 2000. The Protein Data Bank and the challenge of structural genomics. Nat. Struct. Biol. (Suppl.) 7 957–959. - PubMed
    1. Brenner, S.E., Chothia, C., and Hubbard, T.J. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95 6073–6078. - PMC - PubMed
    1. Chambers, J.M. (1998). Programming with data. A guide to the S language. Springer-Verlag, New York.

Publication types