Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 May 5:7:246.
doi: 10.1186/1471-2105-7-246.

Optimizing amino acid substitution matrices with a local alignment kernel

Affiliations

Optimizing amino acid substitution matrices with a local alignment kernel

Hiroto Saigo et al. BMC Bioinformatics. .

Abstract

Background: Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.

Results: Contrary to the local alignment score computed by the Smith-Waterman algorithm, the local alignment kernel is differentiable with respect to the amino acid substitution and its derivative can be computed efficiently by dynamic programming. We optimized the substitution matrix by classical gradient descent by setting an objective function that measures how well the local alignment kernel discriminates homologs from non-homologs in the COG database. The local alignment kernel exhibits better performance when it uses the matrices and gap parameters optimized by this procedure than when it uses the matrices optimized for the Smith-Waterman algorithm. Furthermore, the matrices and gap parameters optimized for the local alignment kernel can also be used successfully by the Smith-Waterman algorithm.

Conclusion: This optimization procedure leads to useful substitution matrices, both for the local alignment kernel and the Smith-Waterman algorithm. The best performance for homology detection is obtained by the local alignment kernel.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Learning curve. ⟨C⟩ values averaged over training, validation and testing data during the optimization process.
Figure 2
Figure 2
Comparison of C values for the LA kernel and the the SW algorithm. In the graph, LA indicates C values of the LA kernel with BLOSUM62LAOPT, and SW indicates C values of the SW algorithm with BLOSUM62SWOPT. If the two methods perform equally, every point would be aligned along the diagonal. More points distributed in the upper-left triangle shows the better performance of the LA kernel with BLOSUM62LAOPT.
Figure 3
Figure 3
Transition of score matrix through optimization. Principal components projection along the two principal eigenvectors for the distribution of BLOSUM62, BLOSUM62SWOPT, BLOSUM62LAOPT and BLOSUM62LAOPT(FINAL). The overall placement of each amino acid residue was roughly similar through the optimized matrices.
Figure 4
Figure 4
Coverage vs Error Per Query plots for the COG distant (identity less than 20%) test set. In the graph, BLOSUM62SWOPT and BLOSUM62LAOPT mean the amino acid substitution matrices optimized starting from BLOSUM62 with the SW algorithm and the LA kernel, respectively. SW scores and LA scores mean the E-values of the SW algorithm and the LA kernel algorithm, respectively. All of the proteins in the COG test set were compared with each other using the SW algorithm or the LA kernel with BLOSUM62LAOPT or BLOSUM62SWOPT. Curves located further to the right side indicate better performance.
Figure 5
Figure 5
Coverage vs Error Per Query plots for the PFAM distant (identity less than 20%) test set. In the graph, BLOSUM62SWOPT and BLOSUM62LAOPT mean the amino acid substitution matrices optimized starting from BLOSUM62 with the SW algorithm and the LA kernel, respectively. SW scores and LA scores mean the E-values of the SW algorithm and the LA kernel algorithm, respectively. All of the proteins in the PFAM distant test set were compared with each other using the SW algorithm or the LA kernel with BLOSUM62LAOPT or BLOSUM62SWOPT. Curves located further to the right side indicate better performance.
Figure 6
Figure 6
Coverage vs Error Per Query plots for the COG close (identity 25–40%) test set. In the graph, BLOSUM62SWOPT and BLOSUM62LAOPT mean the amino acid substitution matrices optimized starting from BLOSUM62 with the SW algorithm and the LA kernel, respectively. SW scores and LA scores mean the E-values of the SW algorithm and the LA kernel algorithm, respectively. All of the proteins in the COG close test set were compared with each other using the SW algorithm or the LA kernel with BLOSUM62LAOPT or BLOSUM62SWOPT. Curves located further to the right side indicate better performance.
Figure 7
Figure 7
Coverage vs Error Per Query plots for the PFAM close (identity 25–40%) test set. In the graph, BLOSUM62SWOPT and BLOSUM62LAOPT mean the amino acid substitution matrices optimized starting from BLOSUM62 with the SW algorithm and the LA kernel, respectively. SW scores and LA scores mean the E-values of the SW algorithm and the LA kernel algorithm, respectively. All of the proteins in the PFAM close test set were compared with each other using the SW algorithm or the LA kernel with BLOSUM62LAOPT or BLOSUM62SWOPT. Curves located further to the right side indicate better performance.
Figure 8
Figure 8
Distribution of Z score of the LA kernel. The curve shows that the Z scores of the LA kernel of non-homologous sequences of the same length randomly shuffled 100000 times follow the Extreme Value Distribution (e((ax)/be((ax)/b))/b MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGGOaakcqWGLbqzdaahaaWcbeqaaiabcIcaOiabcIcaOiabdggaHjabgkHiTiabdIha4jabcMcaPiabc+caViabdkgaIjabgkHiTiabdwgaLnaaCaaameqabaGaeiikaGIaeiikaGIaemyyaeMaeyOeI0IaemiEaGNaeiykaKIaei4la8IaemOyaiMaeiykaKcaaSGaeiykaKcaaOGaei4la8IaemOyaigaaa@464A@ with a = -5.1, b = 6.7).

Similar articles

Cited by

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. A basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Fischer D, Eisenberg D. Finding families for genomic ORFans. Bioinformatics. 1999;15:759–762. doi: 10.1093/bioinformatics/15.9.759. - DOI - PubMed
    1. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. - DOI - PubMed
    1. Bateman A, Birney E, Durbin R, Eddy SR, Finn RD, Sonnhammer ELL. Pfam 3.1: 1313 multiple alignments match the majority of proteins. Nucleic Acids Res. 1999;27:260–262. doi: 10.1093/nar/27.1.260. - DOI - PMC - PubMed

Publication types

LinkOut - more resources