Maximum margin classifier working in a set of strings
- PMID: 27118908
- PMCID: PMC4841474
- DOI: 10.1098/rspa.2015.0551
Maximum margin classifier working in a set of strings
Abstract
Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves a loss of information and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. In this study, we approach this classification problem by constructing a classifier that works in a set of strings. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. Therefore, we first extend a limit theorem for a consensus sequence of strings demonstrated by one of the authors and co-workers in a previous study. Using the obtained result, we then demonstrate that our learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences and classifying RNAs by the secondary structure using nucleotide sequences.
Keywords: bioinformatics; machine learning; probability theory; statistical asymptotics; strings.
Similar articles
-
Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21. IEEE/ACM Trans Comput Biol Bioinform. 2011. PMID: 21282867
-
Fast motif recognition via application of statistical thresholds.BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-11-S1-S11. BMC Bioinformatics. 2010. PMID: 20122182 Free PMC article.
-
Implementation of a Hamming distance-like genomic quantum classifier using inner products on ibmqx2 and ibmq_16_melbourne.Quantum Mach Intell. 2020;2(1):1-26. doi: 10.1007/s42484-020-00017-7. Epub 2020 Jul 17. Quantum Mach Intell. 2020. PMID: 32879908 Free PMC article.
-
Closest string with outliers.BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S55. doi: 10.1186/1471-2105-12-S1-S55. BMC Bioinformatics. 2011. PMID: 21342588 Free PMC article.
-
On Supervised Classification of Feature Vectors with Independent and Non-Identically Distributed Elements.Entropy (Basel). 2021 Aug 13;23(8):1045. doi: 10.3390/e23081045. Entropy (Basel). 2021. PMID: 34441185 Free PMC article.
References
-
- Gusfield D. 1997. Algorithms on strings, trees, and sequences. Cambridge, UK: Cambridge University Press.
-
- Crochemore M, Rytter W. 2002. Jewels of stringology. Singapore: World Scientific.
-
- Koyano H, Kishino H. 2010. Quantifying biodiversity and asymptotics for a sequence of random strings. Phys. Rev. E 81, 061912 (doi:10.1103/PhysRevE.81.061912) - DOI - PubMed
-
- Koyano H, Tsubouchi T, Kishino H, Akutsu T. 2014. Archaeal β diversity patterns under the seafloor along geochemical gradients. J. Geophys. Res. G (Biogeosciences) 119, 1770–1788. (doi:10.1002/2014JG002676) - DOI
-
- Aizerman MA, Braverman EM, Rozoner LI. 1964. Theoretical foundations of the potential function method in pattern recognition learning. Autom. Remote Control 25, 821–837.
LinkOut - more resources
Full Text Sources
Other Literature Sources