Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 1;35(12):2017-2028.
doi: 10.1093/bioinformatics/bty914.

Bastion3: a two-layer ensemble predictor of type III secreted effectors

Affiliations

Bastion3: a two-layer ensemble predictor of type III secreted effectors

Jiawei Wang et al. Bioinformatics. .

Abstract

Motivation: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model.

Results: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction.

Availability and implementation: http://bastion3.erc.monash.edu/.

Contact: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Overall framework of Bastion3. (A) The flowchart of Bastion3 development; (B) Detailed procedures for constructing the prediction models within Bastion3’s two-layer architecture and (C) Tackling the data imbalance problem by assigning a weight to each sample
Fig. 2.
Fig. 2.
The effect and performance comparison of two-step parameter optimization of different feature encoding methods, compared with one-step parameter optimization and initial parameter settings. The red star indicates the best performance amongst the three different parameter settings for each feature encoding method
Fig. 3.
Fig. 3.
Performance comparison of different types of feature encoding methods based on 100-time 5-fold cross-validation test. (A) Embedding of different types of features using t-SNE (van der Maaten and Hinton, 2008). The red and grey dots represent T3SEs and non-T3SEs, respectively. A black-edge dot indicates that this sample was incorrectly predicted during 100-time 5-fold cross-validation. (B) ROC curves and metrics for evaluating the performance of different types of feature encoding methods. The legends of the two panels were merged together with the same feature encoding method denoted by the same color in both panels. The red star on top of the bar chart marks the best performance across different feature encoding methods for each metric
Fig. 4.
Fig. 4.
Performance comparison between Bastion3 (using the final two-layer ensemble model) and six other existing methods for T3SE prediction on the independent test

References

    1. An Y. et al. . (2018) Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI. Brief. Bioinf., 19, 148–161. - PubMed
    1. An Y. et al. . (2017) SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems. Sci. Rep. ,7, 41031. - PMC - PubMed
    1. Arnold R. et al. . (2009) Sequence-based prediction of type III secreted proteins. PLoS Pathogens ,5, e1000376. - PMC - PubMed
    1. Bateman A. et al. . (2002) The Pfam protein families database. Nucleic Acids Res. ,30, 276–280. - PMC - PubMed
    1. Birtalan S.C. et al. . (2002) Three-dimensional secretion signals in chaperone-effector complexes of bacterial pathogens. Mol. Cell ,9, 971–980. - PubMed

Publication types

Substances