Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Denis Newman-Griffis^{1

2}, Eric Fosler-Lussier³

Affiliations

¹ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
² Epidemiology & Biostatistics Section, Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.
³ Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA.

PMID: 33791684
PMCID: PMC8009547
DOI: 10.3389/fdgth.2021.620828

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Denis Newman-Griffis et al. Front Digit Health. 2021 Mar.

. 2021 Mar:3:620828.

doi: 10.3389/fdgth.2021.620828. Epub 2021 Mar 10.

Authors

Denis Newman-Griffis^{1

2}, Eric Fosler-Lussier³

Affiliations

¹ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.
² Epidemiology & Biostatistics Section, Rehabilitation Medicine Department, National Institutes of Health Clinical Center, Bethesda, Maryland, USA.
³ Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio, USA.

PMID: 33791684
PMCID: PMC8009547
DOI: 10.3389/fdgth.2021.620828

Abstract

Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.

Keywords: Disability; EHR data; Free text; ICF; Machine learning; Natural language processing; Physical function; Rehabilitation.

PubMed Disclaimer

Conflict of interest statement

Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
An example activity report describing a clinical observation of mobility, indicating: (i) the Action being described, (ii) a source of Assistance involved in activity performance, and (iii) the Quantification outcome of the measurement setting. The Action in this activity report is assigned ICF code *d450 Walking*.

**Figure 2**
Experimental settings for representing activity reports for machine learning models. Unigram and embedding features were used both separately and together.

**Figure 3**
ICF coding workflow. EHR texts are analyzed to identify activity reports (provided *a priori* for this study), which are assigned ICF codes under either the classification or candidate selection paradigms.

**Figure 4**
Diagram of novel context-dependent projection model for code embeddings. The deep neural network (DNN) takes an activity report and an individual code embedding as input and produces a projected version of the code embedding as output. The same DNN was applied to all code embeddings; similarity scoring is performed using the combined cosine similarity and projection magnitude method of Sabbir et al. (63). After training the model, the softmax operation is removed and similarity scores are produced as output.

**Figure 5**
Macro-averaged F-1 performance on classifying mobility activity reports with ICF codes or *Other*. Results are reported on development data. **(A)** reports on selection of embedding corpus for word2vec (static embeddings), **(B)** reports on selection of BERT model, **(C)** reports on selection of unigram feature weighting, and **(D)** reports on experiments with different feature sets, with and without the Action oracle.

**Figure 6**
Macro-averaged F-1 performance, reported on development data, on candidate selection approaches to coding activity reports according to the ICF. **(A,B)** report on selection of embedding models for word2vec and BERT features, respectively. **(C)** reports results of projected similarity experiments with different numbers of hidden layers in the Deep Neural Network (DNN) component. **(D)** illustrates results using 3-digit ICF code definitions with and without extended definitions, and **(E)** shows results for cosine similarity and projected similarity models with and without the Action oracle.

**Figure 7**
Macro-averaged F-1 performance on test data from cross validation experiments for assigning ICF codes to mobility activity reports. **(A)** compares our best classification model against our best candidate selection model (with and without access to the Action oracle), taking all labels including *Other* into account. **(B)** reports the same comparison on the 12 ICF code labels only, excluding *Other*.

**Figure 8**
Per-label performance analysis, comparing best classification and candidate selection models without access to the Action oracle (left bars) and with oracle access (right bars). Labels are ordered from most frequent (*d450*) to least (*d435*), with the frequency of each provided in parentheses.

**Figure 9**
Confusion matrices for best classification and candidate selection models, without access to the Action oracle (top row) and with oracle access (bottom row). The rows of each matrix indicate the annotated label for a sample, and columns indicate the predicted label. Labels are ordered from most frequent (*d450*) to least frequent (*d435*).

See this image and copyright information in PMC

References

1. Jovanović J, Bagheri E. Semantic annotation in biomedicine: the current landscape. J Biomed Semantics. (2017) 8:1–18. 10.1186/s13326-017-0153-x - DOI - PMC - PubMed
1. Zheng NS, Feng Q, Kerchberger VE, Zhao J, Edwards TL, Cox NJ, et al. PheMap: a multi-resource knowledge base for high-throughput phenotyping within electronic health records. J Am Med Informatics Assoc. (2020) 27:1675–87. 10.1093/jamia/ocaa104 - DOI - PMC - PubMed
1. Hatef E, Rouhizadeh M, Tia I, Lasser E, Hill-Briggs F, Marsteller J, et al. Assessing the availability of data on social and behavioral determinants in structured and unstructured electronic health records: a retrospective analysis of a multilevel health care system. JMIR Med Inf. (2019) 7:e13802. 10.2196/13802 - DOI - PMC - PubMed
1. Feller DJ, Bear Don't Walk IV OJ, Zucker J, Yin MT, Gordon P, Elhadad N. Detecting social and behavioral determinants of health with structured and free-text clinical data. Appl Clin Inf. (2020) 11:172–81. 10.1055/s-0040-1702214 - DOI - PMC - PubMed
1. Meystre S, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. (2008) 17:128–44. 10.1055/s-0038-1638592 - DOI - PubMed

Grants and funding

ZIA CL060065/ImNIH/Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Affiliations

Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources