Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct;143(13):2786-94.
doi: 10.1017/S095026881500014X. Epub 2015 Feb 12.

Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections

Affiliations

Use of random forest to estimate population attributable fractions from a case-control study of Salmonella enterica serotype Enteritidis infections

W Gu et al. Epidemiol Infect. 2015 Oct.

Abstract

To design effective food safety programmes we need to estimate how many sporadic foodborne illnesses are caused by specific food sources based on case-control studies. Logistic regression has substantive limitations for analysing structured questionnaire data with numerous exposures and missing values. We adapted random forest to analyse data of a case-control study of Salmonella enterica serotype Enteritidis illness for source attribution. For estimation of summary population attributable fractions (PAFs) of exposures grouped into transmission routes, we devised a counterfactual estimator to predict reductions in illness associated with removing grouped exposures. For the purpose of comparison, we fitted the data using logistic regression models with stepwise forward and backward variable selection. Our results show that the forward and backward variable selection of logistic regression models were not consistent for parameter estimation, with different significant exposures identified. By contrast, the random forest model produced estimated PAFs of grouped exposures consistent in rank order with results obtained from outbreak data, with egg-related exposures having the highest estimated PAF (22·1%, 95% confidence interval 8·5-31·8). Random forest might be structurally more coherent and efficient than logistic regression models for attributing Salmonella illnesses to sources involving many causal pathways.

Keywords: Causality; counterfactual; foodborne diseases; logistic regression; machine learning.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Permutation importance (blue circles) by mean decrease in classification accuracy of the random forest model [normalized by the standard deviation of the differences in classification accuracy of pre- and post-permutation out-of-bag (unused) data] and exposure frequency in cases (red-grey circles) of individual exposures measured.
Fig. 2.
Fig. 2.
Predicted percentage reduction of illness as a function of probabilistic reduction in grouped exposures based on counterfactual modelling of hypothetical interventions.

References

    1. Scallan E, et al. Foodborne illness acquired in the United States – unspecified agents. Emerging Infectious Diseases 2011; 17: 16–22. - PMC - PubMed
    1. Scallan E, et al. Foodborne illness acquired in the United States – major pathogens. Emerging Infectious Diseases 2011; 17: 7–15. - PMC - PubMed
    1. Levin ML. The occurrence of lung cancer in man. Acta – Unio Internationalis Contra Cancrum 1953; 9: 531–541. - PubMed
    1. Pires SM, et al. Attributing the human disease burden of foodborne infections to specific sources. Foodborne Pathogens and Disease 2009; 6: 417–424. - PubMed
    1. Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: Wiley, 1987.