Computational analysis and prediction of PE_PGRS proteins using machine learning

Fuyi Li¹, Xudong Guo², Dongxu Xiang³, Miranda E Pitt¹, Arnold Bainomugisa⁴, Lachlan J M Coin¹

Affiliations

¹ Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia.
² School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China.
³ Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia.
⁴ Queensland Mycobacterium Reference Laboratory, Brisbane, Australia.

PMID: 35140886
PMCID: PMC8804200
DOI: 10.1016/j.csbj.2022.01.019

Computational analysis and prediction of PE_PGRS proteins using machine learning

Fuyi Li et al. Comput Struct Biotechnol J. 2022.

. 2022 Jan 22:20:662-674.

doi: 10.1016/j.csbj.2022.01.019. eCollection 2022.

Authors

Fuyi Li¹, Xudong Guo², Dongxu Xiang³, Miranda E Pitt¹, Arnold Bainomugisa⁴, Lachlan J M Coin¹

Affiliations

¹ Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC 3000, Australia.
² School of Information Engineering, Ningxia University, Yinchuan, Ningxia 750021, China.
³ Faculty of Engineering and Information Technology, The University of Melbourne, VIC 3000, Australia.
⁴ Queensland Mycobacterium Reference Laboratory, Brisbane, Australia.

PMID: 35140886
PMCID: PMC8804200
DOI: 10.1016/j.csbj.2022.01.019

Abstract

Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.

Keywords: Bioinformatics; Machine learning; Mycobacterial; PE_PGRS; Sequence analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Fig. 1**
The overall framework of PEPPER.

**Fig. 2**
Sequence analysis of known PE_PGRS proteins. (A) Distribution of all collected PE_PGRS proteins according to their protein sequence lengths. (B) Frequency distributions of 20 amino acids in all accumulated PE_PGRS proteins. (C) Sequence-logo of the N-terminal sequence of PE_PGRS proteins. (D) Sequence-logo of the C-terminal sequence of PE_PGRS proteins.

**Fig. 3**
Distribution and clustering of PE_PGRS and non-PE_PGRS proteins based on three groups of features and all features. For each feature group, samples were clustered into two groups using the K-means algorithm, different clusters are represented by different colours. The PE_PGRS and non-PE_PGRS proteins are presented in different shapes, where dots mean PE_PGRS proteins and multiplication signs represent non-PE_PGRS proteins. The inset bar chart in each sub-figure shows the samples distribution (PE_PGRS vs. non-PE_PGRS) in each cluster.

**Fig. 4**
(A) Performance evaluation and comparison of top five machine learning-based predictors with BLASTP and PHMMER. (B) Performance comparison results of two feature selection strategies on the training dataset. (C) Performance comparison results of two feature selection strategies on the independent test dataset. (D) Heatmap plot of the SHAP values for the top 20 important features on the independent test dataset.

**Fig. 5**
(A) Protein 3D structure of PE_PGRS26 (UniProt Accession: Q79FP3) predicted by AlphaFold2. (B) Protein 3D structure of PE_PGRS34 (UniProt Accession: P9WIF3) predicted by AlphaFold2. (C) Domain and disorder regions of two case study proteins. (D) Visualisation of the enriched Gene Ontology terms for the predicted PE_PGRS proteins.

See this image and copyright information in PMC

References

1. Organization, W.H., Global tuberculosis report 2020: executive summary. 2020.
1. Andersen P., Doherty T.M. The success and failure of BCG - implications for a novel tuberculosis vaccine. Nat Rev Microbiol. 2005;3(8):656–662. - PubMed
1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393(6685):537–544. - PubMed
1. Poulet S., Cole S.T. Characterization of the highly abundant polymorphic GC-rich-repetitive sequence (PGRS) present in Mycobacterium tuberculosis. Arch Microbiol. 1995;163(2):87–95. - PubMed
1. Delogu G., Cole S.T., Brosch R. The PE and PPE protein families of Mycobacterium tuberculosis. Handbook of tuberculosis. 2008:131–150.

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Computational analysis and prediction of PE_PGRS proteins using machine learning

Affiliations

Computational analysis and prediction of PE_PGRS proteins using machine learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous