. 2019 Nov 27;20(6):2185-2199.

doi: 10.1093/bib/bby079.

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Yanju Zhang¹, Ruopeng Xie¹, Jiawei Wang², André Leier^{3

4}, Tatiana T Marquez-Lago^{3

4}, Tatsuya Akutsu⁵, Geoffrey I Webb⁶, Kuo-Chen Chou^{7

8}, Jiangning Song^{6

9

10}

Affiliations

¹ School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
² Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia.
³ Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA.
⁴ Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA.
⁵ Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan.
⁶ Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia.
⁷ Gordon Life Science Institute, Boston, MA 02478, USA.
⁸ Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
⁹ Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia.
¹⁰ ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia.

PMID: 30351377
PMCID: PMC6954445
DOI: 10.1093/bib/bby079

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Yanju Zhang et al. Brief Bioinform. 2019.

. 2019 Nov 27;20(6):2185-2199.

doi: 10.1093/bib/bby079.

Authors

Yanju Zhang¹, Ruopeng Xie¹, Jiawei Wang², André Leier^{3

4}, Tatiana T Marquez-Lago^{3

4}, Tatsuya Akutsu⁵, Geoffrey I Webb⁶, Kuo-Chen Chou^{7

8}, Jiangning Song^{6

9

10}

Affiliations

¹ School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China.
² Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, VIC 3800, Australia.
³ Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA.
⁴ Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA.
⁵ Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan.
⁶ Monash Centre for Data Science, Faculty of Information Technology, Monash University, VIC 3800, Australia.
⁷ Gordon Life Science Institute, Boston, MA 02478, USA.
⁸ Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
⁹ Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, VIC 3800, Australia.
¹⁰ ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, VIC 3800, Australia.

PMID: 30351377
PMCID: PMC6954445
DOI: 10.1093/bib/bby079

Abstract

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

Keywords: Light Gradient Boosting Machine; computational prediction; ensemble learning; feature encoding methods; lysine malonylation; machine learning.

PubMed Disclaimer

Figures

**Figure 1**
The overall framework of kmal-sp: (A) An outline of the overall flowchart of the kmal-sp methodology. (B) An illustration of the detailed procedures for constructing the prediction models for each species. First, the collected protein sequences are split into segments with a length (window size) of 25 residues each. Based on these segments, 11 different types of features are extracted that characterize Kmal sites in different aspects (these features are categorized into three main groups). Second, the optimal feature set is selected by applying the GainRatio method to the combined feature set. Based on the optimal feature set, we train the prediction models using several different ML algorithms and also exploit the integration of individual algorithm-based models into ensemble models. Finally, the optimal ensemble model is generated and applied to predict potential Kmal sites with improved accuracy.

**Figure 2**
Sequence characteristics of Kmal sites across the three species. Panels (A), (B) and (C) illustrate the over-represented and under-represented amino acid occurrences in the segments flanking the central Kmal sites of *E. coli*, *M. musculus* and *H. sapiens*, respectively. Sequence logo representations were generated by Two Sample Logo with t-test (P < .05). Panel (D) represents distributions of the sequential distances between malonylation and non-malonylation segments within the same protein sequences.

**Figure 3**
Performance comparison of the RF models trained using 11 different feature types based on 10-time 10-fold cross-validation tests for *E. coli*, *M. musculus* and *H. sapiens*. Randomized 10-fold cross-validation tests were conducted 10 times. The final performance of the RF models was averaged over the 10 times, with the standard error calculated and shown in bars.

**Figure 4**
Performance comparison of RF models trained using different feature sets across the three species. Each feature set was assessed by applying GainRatio (‘gr’) to the original feature sets. Ten-fold cross-validation tests were randomly performed 10 times, and the performance was averaged with calculated standard deviations. Red stars denote the feature set with the overall best performance for the corresponding species, while blue circles represent the original feature set, prior to feature selection.

**Figure 5**
Distribution analysis of generated optimal feature sets across the three species. Panels (A), (B) and (C) illustrate distributions of feature types included in the optimal feature sets for *E. coli*, *M. musculus* and *H. sapiens*, respectively. In each panel (A, B and C), (1) and (2) show the percentage and the number, respectively, of each feature type selected in the optimal feature set, (3) depicts the proportion of the types of features selected in the optimal feature set while (4) provides the GainRatio score for the top 100 selected features in the optimal feature set.

**Figure 6**
Performance comparison between our proposed method kmal-sp and the state-of-the-art method MaloPred for predicting malonylation sites. (A), (B) and (C) ROC curves of both methods on the independent test for predicting malonylation sites of *E. coli*, *M. musculus*, and *H. sapiens*, respectively. (D) histograms showing the performance of kmal-sp and MaloPred in terms of MCC on the independent test.

**Figure 7**
Screenshot of the online web server kmal-sp: (A) the user submission interface and (B) the predicted result for a case study protein sequence as input.

See this image and copyright information in PMC

References

1. Gallego M, Virshup DM. Post-translational modifications regulate the ticking of the circadian clock. Nat Rev Mol Cell Biol 2007;8:139–48. - PubMed
1. Westermann S, Weber K. Post-translational modifications regulate microtubule function. Nat Rev Mol Cell Biol 2003;4:938–47. - PubMed
1. Harmel R, Fiedler D. Features and regulation of non-enzymatic post-translational modifications. Nat Chem Biol 2018;14:244–52. - PubMed
1. Johnson LN. The regulation of protein phosphorylation. Biochem Soc Trans 2009;37:627–41. - PubMed
1. Ambler RP, Rees MW. Epsilon-N-Methyl-lysine in bacterial flagellar protein. Nature 1959;183:1654–5. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions

Grants and funding

R01 AI111965/AI/NIAID NIH HHS/United States

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Affiliations

Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials