Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Mar 3;16(3):e0247872.
doi: 10.1371/journal.pone.0247872. eCollection 2021.

Cohort profile: St. Michael's Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

Affiliations

Cohort profile: St. Michael's Hospital Tuberculosis Database (SMH-TB), a retrospective cohort of electronic health record data and variables extracted using natural language processing

David Landsman et al. PLoS One. .

Abstract

Background: Tuberculosis (TB) is a major cause of death worldwide. TB research draws heavily on clinical cohorts which can be generated using electronic health records (EHR), but granular information extracted from unstructured EHR data is limited. The St. Michael's Hospital TB database (SMH-TB) was established to address gaps in EHR-derived TB clinical cohorts and provide researchers and clinicians with detailed, granular data related to TB management and treatment.

Methods: We collected and validated multiple layers of EHR data from the TB outpatient clinic at St. Michael's Hospital, Toronto, Ontario, Canada to generate the SMH-TB database. SMH-TB contains structured data directly from the EHR, and variables generated using natural language processing (NLP) by extracting relevant information from free-text within clinic, radiology, and other notes. NLP performance was assessed using recall, precision and F1 score averaged across variable labels. We present characteristics of the cohort population using binomial proportions and 95% confidence intervals (CI), with and without adjusting for NLP misclassification errors.

Results: SMH-TB currently contains retrospective patient data spanning 2011 to 2018, for a total of 3298 patients (N = 3237 with at least 1 associated dictation). Performance of TB diagnosis and medication NLP rulesets surpasses 93% in recall, precision and F1 metrics, indicating good generalizability. We estimated 20% (95% CI: 18.4-21.2%) were diagnosed with active TB and 46% (95% CI: 43.8-47.2%) were diagnosed with latent TB. After adjusting for potential misclassification, the proportion of patients diagnosed with active and latent TB was 18% (95% CI: 16.8-19.7%) and 40% (95% CI: 37.8-41.6%) respectively.

Conclusion: SMH-TB is a unique database that includes a breadth of structured data derived from structured and unstructured EHR data by using NLP rulesets. The data are available for a variety of research applications, such as clinical epidemiology, quality improvement and mathematical modeling studies.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Data sources for SMH-TB database.
Fig 2
Fig 2. Patient-level and encounter-level data in SMH-TB.
Fig 3
Fig 3. Example of a component of a ruleset for extracting a variable (active TB diagnosis) from unstructured text in clinical dictations (using CHARTextract).
Fig 4
Fig 4. QuickLabel interface for manual variable abstraction.
(A) Value labels are shown for example variables—the Tuberculin Skin Test (TST) and Interferon Gamma Release Assay (IGRA). (B) A screen shot of a representative data extraction using the Quicklabel tool. The corresponding sentences containing the variables of interest are highlighted in yellow.

Similar articles

Cited by

References

    1. Reid MJA, Arinaminpathy N, Bloom A, Bloom BR, Boehme C, Chaisson R, et al.. Building a tuberculosis-free world: The Lancet Commission on tuberculosis. Lancet Lond Engl. 2019. March 30;393(10178):1331–84. 10.1016/S0140-6736(19)30024-8 - DOI - PubMed
    1. Uplekar M, Weil D, Lönnroth K, Jaramillo E, Lienhardt C, Dias HM, et al.. WHO’s new End TB Strategy. The Lancet. 2015. May 2;385(9979):1799–801. 10.1016/S0140-6736(15)60570-0 - DOI - PubMed
    1. Lönnroth K, Migliori GB, Abubakar I, D’Ambrosio L, Vries G de, Diel R, et al.. Towards tuberculosis elimination: an action framework for low-incidence countries. Eur Respir J. 2015. April 1;45(4):928–52. 10.1183/09031936.00214014 - DOI - PMC - PubMed
    1. CDC. Deciding When to Treat Latent TB Infection [Internet]. 2018 [cited 2020 Aug 25]. Available from: https://www.cdc.gov/tb/topic/treatment/decideltbi.htm
    1. Kim PS, Makhene M, Sizemore C, Hafner R. Viewpoint: Challenges and Opportunities in Tuberculosis Research. J Infect Dis. 2012. May 15;205(suppl_2):S347–52. 10.1093/infdis/jis190 - DOI - PMC - PubMed

Publication types