Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Oct 26;11 Suppl 8(Suppl 8):S7.
doi: 10.1186/1471-2105-11-S8-S7.

Exploiting physico-chemical properties in string kernels

Affiliations

Exploiting physico-chemical properties in string kernels

Nora C Toussaint et al. BMC Bioinformatics. .

Abstract

Background: String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.

Results: We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.

Conclusions: In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.

Availability: Data sets, code and additional information are available from http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask. Implementations of the developed kernels are available as part of the Shogun toolbox.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Learning Curve Analysis on MHC allele A*0201. Shown are areas under the ROC curves averaged over 100 different test splits (30%) and for increasing numbers of training examples (up to 70%). The training part was used for training and model selection using 5-fold cross-validation.
Figure 2
Figure 2
Performance of WD and WD-RBF (blosum50) kernels on human MHC alleles from the IEDB benchmark data set. The pie chart displays the number of alleles for which the WD (green) and the WD-RBF (red) performed best, respectively, and the number of alleles for which they performed equally (blue).

Similar articles

Cited by

References

    1. Saigo H, Vert JP, Ueda N, Akutsu T. Protein homology detection using string alignment kernels. Bioinformatics. 2004;20(11):1682–9. doi: 10.1093/bioinformatics/bth141. - DOI - PubMed
    1. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie C. Profile-based string kernels for remote homology detection and motif extraction. Proceedings IEEE Computational Systems Bioinformatics Conference. 2004. - PubMed
    1. Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS. Semi-supervised protein classification using cluster kernels. Bioinformatics. 2005;21(15):3241–3247. doi: 10.1093/bioinformatics/bti497. - DOI - PubMed
    1. Rätsch G, Sonnenburg S, Srinivasan J, Witte H, Müller KR, Sommer RJ, Schölkopf B. Improving the Caenorhabditis elegans genome annotation using machine learning. PLoS Comput Biol. 2007;3(2):e20. doi: 10.1371/journal.pcbi.0030020. - DOI - PMC - PubMed
    1. Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Ratsch G. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19(11):2133–43. doi: 10.1101/gr.090597.108. - DOI - PMC - PubMed

Publication types