Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Apr;23(e1):e11-9.
doi: 10.1093/jamia/ocv115. Epub 2015 Aug 27.

Data integration of structured and unstructured sources for assigning clinical codes to patient stays

Affiliations

Data integration of structured and unstructured sources for assigning clinical codes to patient stays

Elyne Scheurwegs et al. J Am Med Inform Assoc. 2016 Apr.

Abstract

Objective: Enormous amounts of healthcare data are becoming increasingly accessible through the large-scale adoption of electronic health records. In this work, structured and unstructured (textual) data are combined to assign clinical diagnostic and procedural codes (specifically ICD-9-CM) to patient stays. We investigate whether integrating these heterogeneous data types improves prediction strength compared to using the data types in isolation.

Methods: Two separate data integration approaches were evaluated. Early data integration combines features of several sources within a single model, and late data integration learns a separate model per data source and combines these predictions with a meta-learner. This is evaluated on data sources and clinical codes from a broad set of medical specialties.

Results: When compared with the best individual prediction source, late data integration leads to improvements in predictive power (eg, overall F-measure increased from 30.6% to 38.3% for International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnostic codes), while early data integration is less consistent. The predictive strength strongly differs between medical specialties, both for ICD-9-CM diagnostic and procedural codes.

Discussion: Structured data provides complementary information to unstructured data (and vice versa) for predicting ICD-9-CM codes. This can be captured most effectively by the proposed late data integration approach.

Conclusions: We demonstrated that models using multiple electronic health record data sources systematically outperform models using data sources in isolation in the task of predicting ICD-9-CM codes over a broad range of medical specialties.

Keywords: clinical coding; data integration; data mining; electronic health records; international classification of diseases.

PubMed Disclaimer

Figures

Figure 1:
Figure 1:
The top graph shows the number of patient records in the available datasets per medical specialty; the bottom graph shows the number of unique ICD-9-CM codes (procedural: left, diagnostic: right) per specialty.
Figure 2:
Figure 2:
Visualization of an example set of structured and unstructured data types. (A) The disparate databases. (B) An example structure of a patient medical record, with full lines representing connections of data types and their level and dashed lines to their given date. (C) The mapping on the partial stay level.
Figure 3:
Figure 3:
Presence (in %) of the different data types over all datasets. The main bar shows the average presence, the error bars show the standard deviation between the different medical specialties. Abbreviations are explained in the “dataset” section.
Figure 4:
Figure 4:
Example of data integration for the ICD-9-CM code “430.” This figure illustrates the difference between early and late data integration. (A) A pipeline for early data integration. (B) A pipeline for late data integration.
Figure 5:
Figure 5:
Micro-averaged F-measure for increasing number of data sources. From left to right on the X axis, additional data sources are added (using late data integration). (A) ICD-9-CM diagnostic codes, (B) ICD-9-CM procedural codes. Results for specific specialties are in gray, averaged results in black.

References

    1. Hsiao C-J, Hing E. Use and Characteristics of Electronic Health Record Systems Among Office-Based Physician Practices, United States, 2001-2012. US Department of Health; Human Services, Centers for Disease Control; Prevention, National Center for Health Statistics, United States.
    1. Cimino JJ. Improving the electronic health record—are clinicians getting what they wished for? JAMA. 2013;309(10):991–992. - PMC - PubMed
    1. WHO. International Classification of Diseases. http://www.who.int/classifications/icd/en/. Accessed 25 March 2015.
    1. WHO. International Classification of Primary Care. 2nd edn. 2003. http://www.who.int/classifications/icd/adaptations/icpc2/en/. Accessed 25 March 2015.
    1. WHO. International Classification of Diseases, Clinical Modification (Ninth Revision).http://www.cdc.gov/nchs/icd/icd9cm.htm. Accessed 25 March 2015.

Publication types

MeSH terms