Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 28;22(1):655.
doi: 10.1186/s12879-022-07625-7.

Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach

Affiliations

Helicobacter pylori (H. pylori) risk factor analysis and prevalence prediction: a machine learning-based approach

Van Tran et al. BMC Infect Dis. .

Abstract

Background: Although previous epidemiological studies have examined the potential risk factors that increase the likelihood of acquiring Helicobacter pylori infections, most of these analyses have utilized conventional statistical models, including logistic regression, and have not benefited from advanced machine learning techniques.

Objective: We examined H. pylori infection risk factors among school children using machine learning algorithms to identify important risk factors as well as to determine whether machine learning can be used to predict H. pylori infection status.

Methods: We applied feature selection and classification algorithms to data from a school-based cross-sectional survey in Ethiopia. The data set included 954 school children with 27 sociodemographic and lifestyle variables. We conducted five runs of tenfold cross-validation on the data. We combined the results of these runs for each combination of feature selection (e.g., Information Gain) and classification (e.g., Support Vector Machines) algorithms.

Results: The XGBoost classifier had the highest accuracy in predicting H. pylori infection status with an accuracy of 77%-a 13% improvement from the baseline accuracy of guessing the most frequent class (64% of the samples were H. Pylori negative.) K-Nearest Neighbors showed the worst performance across all classifiers. A similar performance was observed using the F1-score and area under the receiver operating curve (AUROC) classifier evaluation metrics. Among all features, place of residence (with urban residence increasing risk) was the most common risk factor for H. pylori infection, regardless of the feature selection method choice. Additionally, our machine learning algorithms identified other important risk factors for H. pylori infection, such as; electricity usage in the home, toilet type, and waste disposal location. Using a 75% cutoff for robustness, machine learning identified five of the eight significant features found by traditional multivariate logistic regression. However, when a lower robustness threshold is used, machine learning approaches identified more H. pylori risk factors than multivariate logistic regression and suggested risk factors not detected by logistic regression.

Conclusion: This study provides evidence that machine learning approaches are positioned to uncover H. pylori infection risk factors and predict H. pylori infection status. These approaches identify similar risk factors and predict infection with comparable accuracy to logistic regression, thus they could be used as an alternative method.

Keywords: Classification; Ethiopia; Feature selection; H. pylori infection; Logistic regression; Machine learning; School children.

PubMed Disclaimer

Conflict of interest statement

We declare that we do not have any conflicts of interest.

Figures

Fig. 1
Fig. 1
Map of Sululta and Ziway (Batu) can be located to the north and south of Addis Ababa, respectively, Ethiopia
Fig. 2
Fig. 2
Average H. Pylori prevalence prediction accuracy and F1- scores of machine learning classifiers using various feature selection methods. Maroon and blue colors represent high and low accuracy (A), and F1 score (B), respectively. The numbers within each cell indicate the accuracy/F1-score of each classifier-feature selection method pair. KNN indicates K-Nearest Neighbors: SVM, Support Vector Machines; XGB, XGBoost; LR, Logistic Regression; NB, Naive Bayes; and RF, Random Forests. FULL indicates all risk factors are used. IG indicates Information Gain: ReF, ReliefF; MRMR, Minimum Redundancy Maximum Relevance; CFS, Correlation-based Feature Selection; FCBF, Fast Correlation Based Filter; and SFFS, Sequential Floating Forward Selection. The numbers -10 and -20 indicate the number of risk factors selected for the ranking-based feature selection methods. C The Receiver Operating Characteristic (ROC) curves of six classifiers (using their best hyperparameter combination) were obtained when they were used to predict H. pylori infection using a subset of risk factors selected through IG-20 feature selection method. The area under the ROC curve (AUROC) for KNN was 0.76, 0.79 for NB, and 0.78 for the other classifiers. The X-axis represents the False Positive Rate (1-Specificity) whereas the Y-axis represents the True Positive Rate (Sensitivity)
Fig. 3
Fig. 3
The relative importance of  H.pylori risk factors based on all feature selection methods. X-axis indicates the H. Pylori risk factors, summarized in Table 1. Y-axis indicates the average probability of being selected across all feature selection methods. The error bars indicate one standard errors across all cross-validation folds
Fig. 4
Fig. 4
Two-dimensional hierarchical clustering heatmap of H. pylori risk factors and feature selection methods. Maroon and blue colors indicate more and less frequently selected features in five tenfold cross-validation runs, respectively. X-axis shows the H. pylori risk factors, summarized in Table 1. Y-axis indicates all feature selection methods. The risk factors found more frequently by feature selection methods appear on the heatmap's left columns. The feature selection methods that select the greatest number of risk factors appear on the heatmap's bottom rows. The risk factors grouped together suggest that they have been chosen similarly under varying feature selection methods. The feature selection methods grouped together indicate that these methods choose a similar set of risk factors

References

    1. Miernyk KM, Bulkow LR, Gold BD, Bruce MG, Hurlburt DH, Griffin PM, et al. Prevalence of Helicobacter pylori among Alaskans: Factors associated with infection and comparison of urea breath test and anti-Helicobacter pylori IgG antibodies. Helicobacter. 2018;23(3):e12482. doi: 10.1111/hel.12482. - DOI - PMC - PubMed
    1. Eshraghian A. Epidemiology of Helicobacter pylori infection among the healthy population in Iran and countries of the Eastern Mediterranean Region: A systematic review of prevalence and risk factors. World J Gastroenterol. 2014;20(46):17618–17625. doi: 10.3748/wjg.v20.i46.17618. - DOI - PMC - PubMed
    1. Łaszewicz W, Iwańczak F, Iwańczak B, Annabhani A, Bała G, Bąk-Romaniszyn L, et al. Seroprevalence of Helicobacter pylori infection in Polish children and adults depending on socioeconomic status and living conditions. Adv Med Sci. 2014;59(1):147–150. doi: 10.1016/j.advms.2014.01.003. - DOI - PubMed
    1. Mathewos B, Moges B, Dagnew M. Seroprevalence and trend of Helicobacter pylori infection in Gondar University Hospital among dyspeptic patients, Gondar, North West Ethiopia. BMC Res Notes. 2013;6:346. doi: 10.1186/1756-0500-6-346. - DOI - PMC - PubMed
    1. Smith S, Jolaiya T, Fowora M, Palamides P, Ngoka F, Bamidele M, et al. Clinical and Socio- Demographic Risk Factors for Acquisition of Helicobacter pylori Infection in Nigeria. Asian Pac J Cancer Prev. 2018;19(7):1851–1857. - PMC - PubMed