Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology

Affiliations

¹ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. a.heinson@soton.ac.uk.
² Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. y.p.gunawardana@soton.ac.uk.
³ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. bastiaanmoesker@gmail.com.
⁴ London School of Hygiene and Tropical Medicine (LSHTM), Department of Pathogen Molecular BiologyLondon WC1E 7HT, UK. carmen.denman@gmail.com.
⁵ Solutions, University of Southampton, Southampton SO17 1BJ, UK. e.vataga@soton.ac.uk.
⁶ Public Health England, National Infection Service, Porton Down Salisbury, SP4 0JG, UK. yper.hall@phe.gov.uk.
⁷ The Jenner Institute, University of Oxford, Oxford OX3 7DQ, UK. elena.stylianou@ndm.ox.ac.uk.
⁸ The Jenner Institute, University of Oxford, Oxford OX3 7DQ, UK. helen.mcshane@ndm.ox.ac.uk.
⁹ Public Health England, National Infection Service, Porton Down Salisbury, SP4 0JG, UK. ann.rawkins@phe.gov.uk.
¹⁰ Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK. mn@ecs.soton.ac.uk.
¹¹ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. c.h.woelk@soton.ac.uk.

PMID: 28157153
PMCID: PMC5343848
DOI: 10.3390/ijms18020312

Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology

Ashley I Heinson et al. Int J Mol Sci. 2017.

. 2017 Feb 1;18(2):312.

doi: 10.3390/ijms18020312.

Authors

Affiliations

¹ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. a.heinson@soton.ac.uk.
² Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. y.p.gunawardana@soton.ac.uk.
³ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. bastiaanmoesker@gmail.com.
⁴ London School of Hygiene and Tropical Medicine (LSHTM), Department of Pathogen Molecular BiologyLondon WC1E 7HT, UK. carmen.denman@gmail.com.
⁵ Solutions, University of Southampton, Southampton SO17 1BJ, UK. e.vataga@soton.ac.uk.
⁶ Public Health England, National Infection Service, Porton Down Salisbury, SP4 0JG, UK. yper.hall@phe.gov.uk.
⁷ The Jenner Institute, University of Oxford, Oxford OX3 7DQ, UK. elena.stylianou@ndm.ox.ac.uk.
⁸ The Jenner Institute, University of Oxford, Oxford OX3 7DQ, UK. helen.mcshane@ndm.ox.ac.uk.
⁹ Public Health England, National Infection Service, Porton Down Salisbury, SP4 0JG, UK. ann.rawkins@phe.gov.uk.
¹⁰ Department of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ, UK. mn@ecs.soton.ac.uk.
¹¹ Faculty of Medicine, University of Southampton, Southampton SO17 1BJ, UK. c.h.woelk@soton.ac.uk.

PMID: 28157153
PMCID: PMC5343848
DOI: 10.3390/ijms18020312

Abstract

Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future.

Keywords: bacterial pathogen; bacterial protective antigen; machine learning; reverse vaccinology; support vector machine.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**Figure 1**
(A) Plot of the difference in area under the curve (AUC) between the support vector machine (SVM) classifier BPAD200+N+B+AF versus randomly permutated data with increasing feature numbers. SVM classifiers were trained to discriminate bacterial protective antigens (BPAs) and non-BPAs in BPAD200+N+B+AF and receiver operator characteristic (ROC) curves generated from a nested leave tenth out cross-validation approach for different numbers of features selected by greedy backward feature elimination. Five iterations were performed to assess the random breakage of ties during greedy backward feature elimination and AUC was averaged across iterations for each feature set. This analysis was then repeated for five datasets where the BPA and non-BPA labels were randomly permutated and average AUC calculated across randomly permutated data sets for each feature set; (B) ROC curves for the average of the five iterations of the 10 feature SVM classifier derived from BPAD200+N+B+AF (black solid line) and from each of the five randomly-permutated datasets (dotted grey lines).

**Figure 2**
ROC curves were generated from SVM classifiers utilizing 10 features selected by greedy backward feature elimination in a LTOCV approach. Averages were plotted across five iterations of SVM classifiers implemented to randomly break ties resulting from the greedy backward feature elimination procedure. The benchmark to assess these modifications was a non-nested, non-balanced training data set of 136 BPAs and 136 non-BPAs annotated with 122 features from 19 protein annotation tools (BPAD136) [20]. Subsequent modifications were added in a stepwise fashion and included: a nested cross-validation approach (BPAD136+N), balanced selection of non-BPAs for predicted subcellular localization (BPAD136+N+B), increased size of training data (BPAD200+N+B), and additional features (525 total) derived from an increased number of protein annotation tools (BPAD200+N+B+AF).

**Figure 3**
Pie charts showing subcellular localization as predicted by PSORTb [3] for the numbers of BPAs and non-BPAs in the following subsets of the BPAD136 dataset. (A) positive training data (i.e., 136 BPAs); (B) negative training data (i.e., 136 non-BPAs); and (C) negative training data balanced for subcellular localization (i.e., 136 non-BPAs).

**Figure 4**
Hierarchical clustering of 142 BPAs from BPAD200+N+B+AF using all 525 annotation features, distances between BPAs were calculated using Euclidean metrics and then clustered using the Ward algorithm. White labels at the branch tips refer to BPAs with subcellular localization predicted by PSORTb [3] as intracellular (i.e., cytoplasm or cytoplasmic membrane) and black labels as extracellular BPAs (i.e., extracellular, periplasmic, outer membrane, cell wall).

**Figure 5**
(A) ROC curves obtained from SVM classifiers trained to distinguish BPAs from non-BPAs in the following data sets: iBPAD51 (dotted line), eBPAD91 (solid grey line) and BPAD200+N+B+AF (black line). Curves were drawn by averaging results from five iterations of SVM classifiers consisting of 10 features selected by greedy backward feature elimination assessed in a LTOCV approach; (B) Plot showing the average percentage accuracy (five iterations) of SVM classifiers of 10 features trained on different sized subsets of BPAD200+N+B+AF for comparison to SVM classifiers derived from iBPAD51 and eBPAD91.

See this image and copyright information in PMC

References

1. Pizza M., Scarlato V., Masignani M.M., Giuliani B., Arico M., Comanducci G.T., Jennings L., Baldi E., Bartolini B., Capecchi B., et al. Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science. 2000;287:1816–1820. doi: 10.1126/science.287.5459.1816. - DOI - PubMed
1. Crum-Cianflone N., Sullivan E. Meningococcal Vaccinations. Infect. Dis. Ther. 2016;5:89–112. doi: 10.1007/s40121-016-0107-0. - DOI - PMC - PubMed
1. Yu N.Y., Wagner J.R., Laird M.R., Melli G., Rey S., Lo R., Dao P., Sahinalp S.C., Ester M., Foster L.J. PSORTb 3.0: Improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010;26:1608–1615. - PMC - PubMed
1. Corpet F., Servant F., Gouzy J., Kahn D. ProDom and ProDom-CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 2000;28:267–269. doi: 10.1093/nar/28.1.267. - DOI - PMC - PubMed
1. Henikoff S., Henikoff J.G., Pietrokovski S. Blocks+: A non-redundant database of protein alignment blocks derived from multiple compilations. Bioinformatics. 1999;15:471–479. doi: 10.1093/bioinformatics/15.6.471. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology

Affiliations

Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases