Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Mar;472(2187):20150551.
doi: 10.1098/rspa.2015.0551.

Maximum margin classifier working in a set of strings

Affiliations

Maximum margin classifier working in a set of strings

Hitoshi Koyano et al. Proc Math Phys Eng Sci. 2016 Mar.

Abstract

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.

Keywords: bioinformatics; machine learning; probability theory; statistical asymptotics; strings.

PubMed Disclaimer

Similar articles

References

    1. Gusfield D. 1997. Algorithms on strings, trees, and sequences. Cambridge, UK: Cambridge University Press.
    1. Crochemore M, Rytter W. 2002. Jewels of stringology. Singapore: World Scientific.
    1. Koyano H, Kishino H. 2010. Quantifying biodiversity and asymptotics for a sequence of random strings. Phys. Rev. E 81, 061912 (doi:10.1103/PhysRevE.81.061912) - DOI - PubMed
    1. Koyano H, Tsubouchi T, Kishino H, Akutsu T. 2014. Archaeal β diversity patterns under the seafloor along geochemical gradients. J. Geophys. Res. G (Biogeosciences) 119, 1770–1788. (doi:10.1002/2014JG002676) - DOI
    1. Aizerman MA, Braverman EM, Rozoner LI. 1964. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837.

LinkOut - more resources