Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Meta-Analysis
. 2024 Feb 2;24(1):33.
doi: 10.1186/s12911-024-02416-3.

The validity of electronic health data for measuring smoking status: a systematic review and meta-analysis

Affiliations
Meta-Analysis

The validity of electronic health data for measuring smoking status: a systematic review and meta-analysis

Md Ashiqul Haque et al. BMC Med Inform Decis Mak. .

Abstract

Background: Smoking is a risk factor for many chronic diseases. Multiple smoking status ascertainment algorithms have been developed for population-based electronic health databases such as administrative databases and electronic medical records (EMRs). Evidence syntheses of algorithm validation studies have often focused on chronic diseases rather than risk factors. We conducted a systematic review and meta-analysis of smoking status ascertainment algorithms to describe the characteristics and validity of these algorithms.

Methods: The Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines were followed. We searched articles published from 1990 to 2022 in EMBASE, MEDLINE, Scopus, and Web of Science with key terms such as validity, administrative data, electronic health records, smoking, and tobacco use. The extracted information, including article characteristics, algorithm characteristics, and validity measures, was descriptively analyzed. Sources of heterogeneity in validity measures were estimated using a meta-regression model. Risk of bias (ROB) in the reviewed articles was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 tool.

Results: The initial search yielded 2086 articles; 57 were selected for review and 116 algorithms were identified. Almost three-quarters (71.6%) of algorithms were based on EMR data. The algorithms were primarily constructed using diagnosis codes for smoking-related conditions, although prescription medication codes for smoking treatments were also adopted. About half of the algorithms were developed using machine-learning models. The pooled estimates of positive predictive value, sensitivity, and specificity were 0.843, 0.672, and 0.918 respectively. Algorithm sensitivity and specificity were highly variable and ranged from 3 to 100% and 36 to 100%, respectively. Model-based algorithms had significantly greater sensitivity (p = 0.006) than rule-based algorithms. Algorithms for EMR data had higher sensitivity than algorithms for administrative data (p = 0.001). The ROB was low in most of the articles (76.3%) that underwent the assessment.

Conclusions: Multiple algorithms using different data sources and methods have been proposed to ascertain smoking status in electronic health data. Many algorithms had low sensitivity and positive predictive value, but the data source influenced their validity. Algorithms based on machine-learning models for multiple linked data sources have improved validity.

Keywords: Algorithms; Electronic health records; Review; Routinely collected health data; Validation study.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Three-level meta-regression model
Fig. 2
Fig. 2
Flowchart of the study selection process
Fig. 3
Fig. 3
Percent (%) of smoking status algorithms characterized by validity measures (n = 116)
Fig. 4
Fig. 4
Distribution of selected algorithm validity measures, stratified by algorithm data source. Note: The centre horizontal line within the box represents the median (50th percentile); upper and lower bounds of the box indicate the 25th and 75th percentiles; dashed lines connect the maximum and minimum values; circles represent outliers. PPV = positive predictive value
Fig. 5
Fig. 5
Distribution of selected algorithm validity measures, stratified by algorithm data structure. Note: The centre horizontal line within the box represents the median (50th percentile); upper and lower bounds of the box indicate the 25th and 75th percentiles; dashed lines connect the maximum and minimum; circles represent outliers. PPV = positive predictive value
Fig. 6
Fig. 6
Distribution of selected algorithm validity measures, stratified by use of predictive model in algorithm construction. Note: The centre horizontal line within the box represents the median (50th percentile); upper and lower bounds of the box indicate the 25th and 75th percentiles; dashed lines connect the maximum and minimum with the box; circles represent outliers. PPV = positive predictive value

References

    1. Cowie MR, Blomster JI, Curtis LH, Duclaux S, Ford I, Fritz F, Goldman S, Janmohamed S, Kreuzer J, Leenay M, Michel A. Electronic health records to facilitate clinical research. Clin Res Cardiol. 2017;106:1–9. doi: 10.1007/s00392-016-1025-6. - DOI - PMC - PubMed
    1. Lee S, Xu Y, D'Souza AG, Martin EA, Doktorchik C, Zhang Z, Quan H. Unlocking the potential of electronic health records for health research. Int J Popul Data Sci. 2020;5(1):1123. - PMC - PubMed
    1. Kierkegaard P. Electronic health record: wiring Europe’s healthcare. Comput Law Secur Rev. 2011;27(5):503–515. doi: 10.1016/j.clsr.2011.07.013. - DOI
    1. Harbaugh CM, Cooper JN. Administrative databases. Semin Pediatr Surg. 2018;27(6):353–360. doi: 10.1053/j.sempedsurg.2018.10.001. - DOI - PubMed
    1. World Health Organization. Tobacco fact sheet from WHO providing key facts and information on surveillance. https://www.who.int/news-room/fact-sheets/detail/tobacco. Accessed 10 Apr 2022.

LinkOut - more resources