Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Nov 28;17(1):907.
doi: 10.1186/s12889-017-4914-3.

A systematic review of data mining and machine learning for air pollution epidemiology

Affiliations

A systematic review of data mining and machine learning for air pollution epidemiology

Colin Bellinger et al. BMC Public Health. .

Abstract

Background: Data measuring airborne pollutants, public health and environmental factors are increasingly being stored and merged. These big datasets offer great potential, but also challenge traditional epidemiological methods. This has motivated the exploration of alternative methods to make predictions, find patterns and extract information. To this end, data mining and machine learning algorithms are increasingly being applied to air pollution epidemiology.

Methods: We conducted a systematic literature review on the application of data mining and machine learning methods in air pollution epidemiology. We carried out our search process in PubMed, the MEDLINE database and Google Scholar. Research articles applying data mining and machine learning methods to air pollution epidemiology were queried and reviewed.

Results: Our search queries resulted in 400 research articles. Our fine-grained analysis employed our inclusion/exclusion criteria to reduce the results to 47 articles, which we separate into three primary areas of interest: 1) source apportionment; 2) forecasting/prediction of air pollution/quality or exposure; and 3) generating hypotheses. Early applications had a preference for artificial neural networks. In more recent work, decision trees, support vector machines, k-means clustering and the APRIORI algorithm have been widely applied. Our survey shows that the majority of the research has been conducted in Europe, China and the USA, and that data mining is becoming an increasingly common tool in environmental health. For potential new directions, we have identified that deep learning and geo-spacial pattern mining are two burgeoning areas of data mining that have good potential for future applications in air pollution epidemiology.

Conclusions: We carried out a systematic review identifying the current trends, challenges and new directions to explore in the application of data mining methods to air pollution epidemiology. This work shows that data mining is increasingly being applied in air pollution epidemiology. The potential to support air pollution epidemiology continues to grow with advancements in data mining related to temporal and geo-spacial mining, and deep learning. This is further supported by new sensors and storage mediums that enable larger, better quality data. This suggests that many more fruitful applications can be expected in the future.

Keywords: Air pollution; Association mining; Big data; Data mining; Epidemiology; Exposure; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
PRISMA flow diagram. Overview of the PRISMA results from our search process
Fig. 2
Fig. 2
Publications Per Country. The number of publications per country identified a predominance in the filed by European countries, the USA and China
Fig. 3
Fig. 3
Publications Per Year. Number of articles per year between January 2000 and October 2017. We identified an apparent tendency of an increased number of publications on data mining and epidemiology in recent years

References

    1. Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55(10):78–87. doi: 10.1145/2347736.2347755. - DOI
    1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. doi: 10.1038/nature14539. - DOI - PubMed
    1. Dietterich TG, et al. Ensemble methods in machine learning. Multiple Classif Syst. 2000;1857:1–15. doi: 10.1007/3-540-45014-9_1. - DOI
    1. Lary DJ, Faruque FS, Malakar N, Moore A, Roscoe B, Adams ZL, Eggelston Y. Estimating the global abundance of ground level presence of particulate matter (pm2. 5) Geospatial Health. 2014;8(3):611–30. doi: 10.4081/gh.2014.292. - DOI - PMC - PubMed
    1. Neto UMB, Dougherty ER. Error estimation for pattern recognition. Hoboken: John Wiley & Sons; 2015.

Publication types

MeSH terms

Grants and funding