Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Dec 1;34(23):4007-4016.
doi: 10.1093/bioinformatics/bty451.

ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides

Affiliations

ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides

Leyi Wei et al. Bioinformatics. .

Abstract

Motivation: Anti-cancer peptides (ACPs) have recently emerged as promising therapeutic agents for cancer treatment. Due to the avalanche of protein sequence data in the post-genomic era, there is an urgent need to develop automated computational methods to enable fast and accurate identification of novel ACPs within the vast number of candidate proteins and peptides.

Results: To address this, we propose a novel predictor named Anti-Cancer peptide Predictor with Feature representation Learning (ACPred-FL) for accurate prediction of ACPs based on sequence information. More specifically, we develop an effective feature representation learning model, with which we can extract and learn a set of informative features from a pool of support vector machine-based models trained using sequence-based feature descriptors. By doing so, the class label information of data samples is fully utilized. To improve the feature representation, we further employ a two-step feature selection technique, resulting in a most informative five-dimensional feature vector for the final peptide representation. Experimental results show that such five features provide the most discriminative power for identifying ACPs than currently available feature descriptors, highlighting the effectiveness of the proposed feature representation learning approach. The developed ACPred-FL method significantly outperforms state-of-the-art methods.

Availability and implementation: The web-server of ACPred-FL is available at http://server.malab.cn/ACPred-FL.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Flowchart of ACPred-FL. There exist three major steps: Firstly, given protein primary sequences as the input, they are scanned residue by residue using a peptide window with m residues to generate numerous peptides; those peptides that are identical to others will be filtered out; Secondly, the remaining peptides are subjected to the feature representation learning scheme, and each of them is encoded with a five-dimensional feature vector; Thirdly, the resulting feature vectors are fed into a predictive model, which is trained with a SVM classifier on the ACP500 dataset. Ultimately, the SVM model generates a prediction score for each peptide in the range from 0 to 1. The predictor considers the peptides as potential ACPs if their prediction scores are higher than 0.5, and non-ACPs otherwise
Fig. 2.
Fig. 2.
The proposed feature representation learning scheme. First, peptide sequences are subjected to feature presentation using seven feature descriptors. To incorporate sufficient information, we alter the parameters of the feature descriptors, and then generate 40 feature groups to form the initial feature pool; Second, the resulting feature groups are then fed into well-trained SVM models for predicting the class labels, and finally, the predicted labels (0/1) from the SVM models are concatenated to generate a new feature vector for representation of peptide sequences
Fig. 3.
Fig. 3.
Predictive performance of different feature descriptors on the 10-fold cross-validation and independent tests. (A) The ROC curves illustrating the 10-fold cross-validation performance of three types of feature descriptors. (B) The ROC curves illustrating the independent test performance of the three types of feature descriptors
Fig. 4.
Fig. 4.
Predictive performance of models based on different classifiers: (A) The ROC curve illustrating the 10-fold cross validation performances of the proposed features but with three different classifiers (NB, RF and SVM). (B) The ROC curve illustrating the independent test performances of three types of the proposed features but with three different classifiers (NB, RF and SVM)
Fig. 5.
Fig. 5.
mRMR feature selection of the proposed features. (A) The classification importance scores for the 40 generated features. Note that ‘fea1’ denotes the 1st feature among all the generated features. (B) SFS curve for the predictive model with respect to the ACC and MCC. The x- and y-axis represent the feature number t (ranging from 1 to 40) and the predictive performance, respectively. The blue and orange plots represent the SFS curves of ACC and MCC, respectively (Color version of this figure is available at Bioinformatics online.)
Fig. 6.
Fig. 6.
Distribution of the positive and negative samples with respect to different feature descriptors. (A) - (F) represent the distributions of BPF (k=2), GDC (g=3), OPF (k=1), BPF (k=7), CTD, and the proposed features, respectively. 90% of the positive samples (ACPs) and negative samples (non-ACPs) were randomly selected at each sampling step. This sampling procedure was repeated 20 times to obtain sub-sample average feature vectors. On each feature dimension, we calculated the mean and SDs of the feature vectors
Fig. 7.
Fig. 7.
Performance comparison of the proposed ACPred-FL and four state-of-the-art predictors. (A) Ten-fold cross validation results of the proposed ACPred-FL and the existing four predictive models on the ACP500 dataset. (B) ROC curves of the proposed ACPred-FL and existing four predictive models on the ACP500 dataset. (C) Independent test results of the proposed ACPred-FL and existing four predictive models on the ACP164 dataset. (D) ROC curves of the proposed ACPred-FL and existing four predictive models on the ACP164 dataset
Fig. 8.
Fig. 8.
Performance comparison of the proposed ACPred-FL and four state-of-the-art predictors on Tyagi’s dataset. (A) Ten-fold cross validation results of the proposed ACPred-FL and the existing four predictive models on the training set of Tyagi’s dataset. (B) ROC curves of the proposed ACPred-FL and existing four predictive models on the training set of Tyagi’s dataset. (C) Independent test results of the proposed ACPred-FL and existing four predictive models on the training set of Tyagi’s dataset. (D) ROC curves of the proposed ACPred-FL and existing four predictive models on the testing set of Tyagi’s dataset

Similar articles

Cited by

References

    1. Barras D., Widmann C. (2011) Promises of apoptosis-inducing peptides in cancer therapeutics. Curr. Pharm. Biotechnol. ,12, 1153–1165. - PubMed
    1. Boohaker R.J., et al. (2012) The use of therapeutic peptides to target and to kill cancer cells. Curr. Med. Chem. ,19, 3794. - PMC - PubMed
    1. Chen W., et al. (2016) iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget ,7, 16895. - PMC - PubMed
    1. Diana G., et al. (2013) From antimicrobial to anticancer peptides. A review. Front. Microbiol. ,4, 294. - PMC - PubMed
    1. Ding C., Peng H. (2003) Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol., 3, 185–205. - PubMed

Publication types