Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 30:13:e18863.
doi: 10.7717/peerj.18863. eCollection 2025.

Prediction of influenza A virus-human protein-protein interactions using XGBoost with continuous and discontinuous amino acids information

Affiliations

Prediction of influenza A virus-human protein-protein interactions using XGBoost with continuous and discontinuous amino acids information

Binghua Li et al. PeerJ. .

Abstract

Influenza A virus (IAV) has the characteristics of high infectivity and high pathogenicity, which makes IAV infection a serious public health threat. Identifying protein-protein interactions (PPIs) between IAV and human proteins is beneficial for understanding the mechanism of viral infection and designing antiviral drugs. In this article, we developed a sequence-based machine learning method for predicting PPI. First, we applied a new negative sample construction method to establish a high-quality IAV-human PPI dataset. Then we used conjoint triad (CT) and Moran autocorrelation (Moran) to encode biologically relevant features. The joint consideration utilizing the complementary information between contiguous and discontinuous amino acids provides a more comprehensive description of PPI information. After comparing different machine learning models, the eXtreme Gradient Boosting (XGBoost) model was determined as the final model for the prediction. The model achieved an accuracy of 96.89%, precision of 98.79%, recall of 94.85%, F1-score of 96.78%. Finally, we successfully identified 3,269 potential target proteins. Gene ontology (GO) and pathway analysis showed that these genes were highly associated with IAV infection. The analysis of the PPI network further revealed that the predicted proteins were classified as core proteins within the human protein interaction network. This study may encourage the identification of potential targets for the discovery of more effective anti-influenza drugs. The source codes and datasets are available at https://github.com/HVPPIlab/IVA-Human-PPI/.

Keywords: GO and KEGG; Influenza A virus; Machine learning; Pathogen-host interaction (PHI); Protein-protein interaction (PPI); XGBoost.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1. Schematic framework of our research.
The specific steps are described as follows: Step 1: Data preparation. A training dataset was gathered for model construction and training. An Independent dataset was constructed to predict potential PPI. Step 2: Feature extraction and feature fusion. Using conjoint triad and Moran autocorrelation descriptors to convert the protein sequences into feature vectors and extract PPI information from the sequences. Step 3: Model construction and training. Five-fold cross-validation was used to train the XGBoost model with optimal parameters. Step 4: Prediction and results analysis. The trained model was utilized to predict potential PPIs and the systems biology analysis was performed on the predicted results.
Figure 2
Figure 2. Flow chart of positive sample construction.
Figure 3
Figure 3. The workflow of degree and dissimilarity-based negative sampling.
Figure 4
Figure 4. ACC, MCC, and F1-score of different features from different categories on independent dataset.
Figure 5
Figure 5. The comparison of accuracy, recall, F1-score and MCC of best-performing features in each category with CT+Moran.
Figure 6
Figure 6. Accuracy of each fold in the five-fold cross-validation of different models.
Figure 7
Figure 7. Accuracy, Precision, Recall, F1-score and MCC for different models on independent dataset.
Figure 8
Figure 8. (A–D) Results of GO and KEGG pathway analysis.

References

    1. Ahmad A, Ahad A, Rao AQ, Husnain T. Molecular docking based screening of neem-derived compounds with the NS1 protein of Influenza virus. Bioinformation. 2015;11(7):359–365. doi: 10.6026/97320630011359. - DOI - PMC - PubMed
    1. Ain SZ, Aiman S, Zhou T, Li C. A systems biology-driven approach to construct a comprehensive protein interaction network of influenza A virus with its host. BMC Infectious Diseases. 2020;20:480. doi: 10.1186/s12879-020-05214-0. - DOI - PMC - PubMed
    1. Ammari M, Gresham C, Mccarthy F, Nanduri B. Database update HPIDB 2.0: a curated database for host-pathogen interactions. Database. 2016;2016:baw103. doi: 10.1093/database/baw103. - DOI - PMC - PubMed
    1. Arenas AF, Elena G, Montoya AM, Gomez-Marin JE. MSCA: a spectral comparison algorithm between time series to identify protein-protein interactions. BMC Bioinformatics. 2015;16:533. doi: 10.1186/s12859-015-0599-8. - DOI - PMC - PubMed
    1. Ben-Hur A, Noble W. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics. 2006;7:S2. doi: 10.1186/1471-2105-7-S1-S2. - DOI - PMC - PubMed

LinkOut - more resources