Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan 22:20:662-674.
doi: 10.1016/j.csbj.2022.01.019. eCollection 2022.

Computational analysis and prediction of PE_PGRS proteins using machine learning

Affiliations

Computational analysis and prediction of PE_PGRS proteins using machine learning

Fuyi Li et al. Comput Struct Biotechnol J. .

Abstract

Mycobacterium tuberculosis genome comprises approximately 10% of two families of poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought to be involved in host response and disease pathogenicity. Due to their high genetic variability and complexity of analysis, they are typically disregarded for further research in genomic studies. There are currently limited online resources and homology computational tools that can identify and analyse PE_PGRS proteins. In addition, they are computational-intensive and time-consuming, and lack sensitivity. Therefore, computational methods that can rapidly and accurately identify PE_PGRS proteins are valuable to facilitate the functional elucidation of the PE_PGRS family proteins. In this study, we developed the first machine learning-based bioinformatics approach, termed PEPPER, to allow users to identify PE_PGRS proteins rapidly and accurately. PEPPER was built upon a comprehensive evaluation of 13 popular machine learning algorithms with various sequence and physicochemical features. Empirical studies demonstrated that PEPPER achieved significantly better performance than alignment-based approaches, BLASTP and PHMMER, in both prediction accuracy and speed. PEPPER is anticipated to facilitate community-wide efforts to conduct high-throughput identification and analysis of PE_PGRS proteins.

Keywords: Bioinformatics; Machine learning; Mycobacterial; PE_PGRS; Sequence analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

None
Graphical abstract
Fig. 1
Fig. 1
The overall framework of PEPPER.
Fig. 2
Fig. 2
Sequence analysis of known PE_PGRS proteins. (A) Distribution of all collected PE_PGRS proteins according to their protein sequence lengths. (B) Frequency distributions of 20 amino acids in all accumulated PE_PGRS proteins. (C) Sequence-logo of the N-terminal sequence of PE_PGRS proteins. (D) Sequence-logo of the C-terminal sequence of PE_PGRS proteins.
Fig. 3
Fig. 3
Distribution and clustering of PE_PGRS and non-PE_PGRS proteins based on three groups of features and all features. For each feature group, samples were clustered into two groups using the K-means algorithm, different clusters are represented by different colours. The PE_PGRS and non-PE_PGRS proteins are presented in different shapes, where dots mean PE_PGRS proteins and multiplication signs represent non-PE_PGRS proteins. The inset bar chart in each sub-figure shows the samples distribution (PE_PGRS vs. non-PE_PGRS) in each cluster.
Fig. 4
Fig. 4
(A) Performance evaluation and comparison of top five machine learning-based predictors with BLASTP and PHMMER. (B) Performance comparison results of two feature selection strategies on the training dataset. (C) Performance comparison results of two feature selection strategies on the independent test dataset. (D) Heatmap plot of the SHAP values for the top 20 important features on the independent test dataset.
Fig. 5
Fig. 5
(A) Protein 3D structure of PE_PGRS26 (UniProt Accession: Q79FP3) predicted by AlphaFold2. (B) Protein 3D structure of PE_PGRS34 (UniProt Accession: P9WIF3) predicted by AlphaFold2. (C) Domain and disorder regions of two case study proteins. (D) Visualisation of the enriched Gene Ontology terms for the predicted PE_PGRS proteins.

References

    1. Organization, W.H., Global tuberculosis report 2020: executive summary. 2020.
    1. Andersen P., Doherty T.M. The success and failure of BCG - implications for a novel tuberculosis vaccine. Nat Rev Microbiol. 2005;3(8):656–662. - PubMed
    1. Cole S.T., Brosch R., Parkhill J., Garnier T., Churcher C., Harris D., et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature. 1998;393(6685):537–544. - PubMed
    1. Poulet S., Cole S.T. Characterization of the highly abundant polymorphic GC-rich-repetitive sequence (PGRS) present in Mycobacterium tuberculosis. Arch Microbiol. 1995;163(2):87–95. - PubMed
    1. Delogu G., Cole S.T., Brosch R. The PE and PPE protein families of Mycobacterium tuberculosis. Handbook of tuberculosis. 2008:131–150.

LinkOut - more resources