Maximum margin classifier working in a set of strings

Hitoshi Koyano¹, Morihiro Hayashida², Tatsuya Akutsu²

Affiliations

¹ Laboratory of Biostatistics and Bioinformatics , Graduate School of Medicine, Kyoto University , 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan.
² Laboratory of Mathematical Bioinformatics , Institute for Chemical Research, Kyoto University , Gokasho, Uji, Kyoto 611-0011, Japan.

PMID: 27118908
PMCID: PMC4841474
DOI: 10.1098/rspa.2015.0551

Maximum margin classifier working in a set of strings

Hitoshi Koyano et al. Proc Math Phys Eng Sci. 2016 Mar.

. 2016 Mar;472(2187):20150551.

doi: 10.1098/rspa.2015.0551.

Authors

Hitoshi Koyano¹, Morihiro Hayashida², Tatsuya Akutsu²

Affiliations

¹ Laboratory of Biostatistics and Bioinformatics , Graduate School of Medicine, Kyoto University , 54 Kawahara-cho, Shogoin, Sakyo-ku, Kyoto 606-8507, Japan.
² Laboratory of Mathematical Bioinformatics , Institute for Chemical Research, Kyoto University , Gokasho, Uji, Kyoto 611-0011, Japan.

PMID: 27118908
PMCID: PMC4841474
DOI: 10.1098/rspa.2015.0551

Abstract

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.

Keywords: bioinformatics; machine learning; probability theory; statistical asymptotics; strings.

PubMed Disclaimer

References

1. Gusfield D. 1997. Algorithms on strings, trees, and sequences. Cambridge, UK: Cambridge University Press.
1. Crochemore M, Rytter W. 2002. Jewels of stringology. Singapore: World Scientific.
1. Koyano H, Kishino H. 2010. Quantifying biodiversity and asymptotics for a sequence of random strings. Phys. Rev. E 81, 061912 (doi:10.1103/PhysRevE.81.061912) - DOI - PubMed
1. Koyano H, Tsubouchi T, Kishino H, Akutsu T. 2014. Archaeal β diversity patterns under the seafloor along geochemical gradients. J. Geophys. Res. G (Biogeosciences) 119, 1770–1788. (doi:10.1002/2014JG002676) - DOI
1. Aizerman MA, Braverman EM, Rozoner LI. 1964. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837.

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Maximum margin classifier working in a set of strings

Affiliations

Maximum margin classifier working in a set of strings

Authors

Affiliations

Abstract

References

LinkOut - more resources

Full Text Sources

Other Literature Sources