Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 5:8:285.
doi: 10.3389/fbioe.2020.00285. eCollection 2020.

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Affiliations

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Changli Feng et al. Front Bioeng Biotechnol. .

Abstract

The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.

Keywords: machine learning methods; mixed features; non-thermophilic protein; reduced amino acids; thermophilic protein.

PubMed Disclaimer

Figures

FIGURE 1
FIGURE 1
The whole framework of the proposed method in this manuscript.
FIGURE 2
FIGURE 2
The figure of model performance. (A) The first two dimensions of the result of compression characteristics of the TSNE method; (B) the figure of the ultra-classification surface of SVM method; (C) The accuracy values of four different models; (D) the comparison results with other methods.
FIGURE 3
FIGURE 3
The comparison results of experiments. (A) The receiver operation characteristic (ROC) curve of three methods; (B) the results of experiments over the database (Fan et al., 2016).
FIGURE 4
FIGURE 4
The critical and removed features in the proposed method: (A) the most important features; (B) the deleted amino acid frequency features; (C) the deleted reduced-depiptides (I); (D) the deleted reduced-depiptides (II). The symbol “*” means any one of the 20 amino acids, it may be “A”, “C”, “P”, or others. Besides, “**” has the same meaning; it represents a two-letter combination of 20 amino acids, “AA”, “DC”, “VP”, for example.

Similar articles

Cited by

References

    1. Bhola A., Singh S. (2018). Gene selection using high dimensional gene expression data: an appraisal. Curr. Bioinf. 13 225–233.
    1. Bleicher L., Prates E. T., Gomes T. C. F., Silveira R. L., Nascimento A. S., Rojas A. L., et al. (2011). Molecular basis of the thermostability and thermophilicity of laminarinases: x-ray structure of the hyperthermostable laminarinase from rhodothermus marinus and molecular dynamics simulations. J. Phys. Chem. B 115 7940–7949. 10.1021/jp200330z - DOI - PubMed
    1. Cai C. Z., Han L. Y., Ji Z. L., Chen X., Chen Y. Z. (2003). SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31 3692–3697. - PMC - PubMed
    1. Chen C., Zhang Q. M., Ma Q., Yu B. (2019). LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometr. Intell. Labor. Syst. 191 54–64.
    1. Chen X. X., Tang H., Li W. C., Wu H., Chen W., Ding H., et al. (2016). Identification of bacterial cell wall lyases via pseudo amino acid composition. Biomed. Res. Int. 2018:8. - PMC - PubMed

LinkOut - more resources