. 2020 Jul:139:104135.

doi: 10.1016/j.ijmedinf.2020.104135. Epub 2020 Apr 4.

Automated ICD coding via unsupervised knowledge integration (UNITE)

Aaron Sonabend W¹, Winston Cai², Yuri Ahuja¹, Ashwin Ananthakrishnan³, Zongqi Xia⁴, Sheng Yu⁵, Chuan Hong⁶

Affiliations

¹ Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
² Bronx Science, New York City, NY, USA.
³ Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, USA.
⁴ Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. Electronic address: chuan_hong@hms.harvard.edu.

PMID: 32361145
PMCID: PMC9410729
DOI: 10.1016/j.ijmedinf.2020.104135

Automated ICD coding via unsupervised knowledge integration (UNITE)

Aaron Sonabend W et al. Int J Med Inform. 2020 Jul.

. 2020 Jul:139:104135.

doi: 10.1016/j.ijmedinf.2020.104135. Epub 2020 Apr 4.

Authors

Aaron Sonabend W¹, Winston Cai², Yuri Ahuja¹, Ashwin Ananthakrishnan³, Zongqi Xia⁴, Sheng Yu⁵, Chuan Hong⁶

Affiliations

¹ Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA.
² Bronx Science, New York City, NY, USA.
³ Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, USA.
⁴ Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China.
⁶ Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA. Electronic address: chuan_hong@hms.harvard.edu.

PMID: 32361145
PMCID: PMC9410729
DOI: 10.1016/j.ijmedinf.2020.104135

Abstract

Objective: Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital EMR system and is subject to bias rising from the practice and coding behavior. Consequently, portability of trained supervised algorithms to external EMR systems may suffer.

Method: We developed an unsupervised knowledge integration (UNITE) algorithm to automatically assign ICD codes for a specific disease by analyzing clinical narrative notes via semantic relevance assessment. The algorithm was validated using coded ICD data for 6 diseases from Partners HealthCare (PHS) Biobank and Medical Information Mart for Intensive Care (MIMIC-III). We compared the performance of UNITE against penalized logistic regression (LR), topic modeling, and neural network models within each EMR system. We additionally evaluated the portability of UNITE by training at PHS Biobank and validating at MIMIC-III, and vice versa.

Results: UNITE achieved an averaged AUC of 0.91 at PHS and 0.92 at MIMIC over 6 diseases, comparable to LR and MLP. It had substantially better performance than topic models. In regards to portability, the performance of UNITE was consistent across different EMR systems, superior to LR, topic models and neural network models.

Conclusion: UNITE accurately assigns ICD code in EMR without requiring human labor, and has major advantages over commonly used machine learning approaches. In addition, the UNITE attained stable performance and high portability across EMRs in different institutions.

Keywords: Automated ICD assignment; Electronic medical records; Knowledge integration; Portability; Semantic embedding; Unsupervised learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest All authors have declared that they have no financial or non-financial interests that may be relevant to the submitted work; no other relationships or activities that could appear to have influenced the submitted work.

Figures

**Figure 1.**
Workflow of UNITE in predicting the presence of ICD for one specific disease. NLP: natural language processing; NER: named entity recognition; CUI: concept unique identifier. β is the regression coefficient which serves as the weights for importance of the CUIs.

**Figure 2.**
Relative CUI importance for RA, CAD, UC, MS, LC, CD using a word cloud representation with the magnitude of the font size proportional to the CUI importance.

**Figure 3.**
AUC for penalized logistic regression (LR), penalized logistic regression with UNITE (UNITE-LR), penalized logistic regression with the mean CUI (mCUI-LR), penalized logistic regression with the TFIDF (TFIDF-LR), multiple layer perceptron (MLP), multiple layer perceptron with UNITE features (UNITE-MLP), multiple layer perceptron with the mean CUI (mCUI-MLP), multiple layer perceptron with TFIDF (TFIDF-MLP), topic modeling based on LDA with 2 topics fit with VEM (LDA_VEM) and Gibbs (LDA_Gibbs), CTM with 2 topics fit with VEM (CTM_VEM) and UNITE. Each row is a different disease, columns show performance for methods trained and tested on either the same hospital system (first two panels), or trained and tested in different hospital systems (last two panels).

**Figure 4.**
F-score for penalized logistic regression (LR), penalized logistic regression with UNITE (UNITE-LR), penalized logistic regression with the mean CUI (mCUI-LR), penalized logistic regression with the TFIDF (TFIDF-LR), multiple layer perceptron (MLP), multiple layer perceptron with UNITE features (UNITE-MLP), multiple layer perceptron with the mean CUI (mCUI-MLP), multiple layer perceptron with TFIDF (TFIDF-MLP), topic modeling based on LDA with 2 topics fit with VEM (LDA_VEM) and Gibbs (LDA_Gibbs), CTM with 2 topics fit with VEM (CTM_VEM) and UNITE.

See this image and copyright information in PMC

Cited by

Automated ICD coding for coronary heart diseases by a deep learning method.
Zhao S, Diao X, Xia Y, Huo Y, Cui M, Wang Y, Yuan J, Zhao W. Zhao S, et al. Heliyon. 2023 Feb 27;9(3):e14037. doi: 10.1016/j.heliyon.2023.e14037. eCollection 2023 Mar. Heliyon. 2023. PMID: 36938427 Free PMC article.
Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks.
Sammani A, Bagheri A, van der Heijden PGM, Te Riele ASJM, Baas AF, Oosters CAJ, Oberski D, Asselbergs FW. Sammani A, et al. NPJ Digit Med. 2021 Feb 26;4(1):37. doi: 10.1038/s41746-021-00404-9. NPJ Digit Med. 2021. PMID: 33637859 Free PMC article.
Comparison of different feature extraction methods for applicable automated ICD coding.
Shuai Z, Xiaolin D, Jing Y, Yanni H, Meng C, Yuxin W, Wei Z. Shuai Z, et al. BMC Med Inform Decis Mak. 2022 Jan 12;22(1):11. doi: 10.1186/s12911-022-01753-5. BMC Med Inform Decis Mak. 2022. PMID: 35022039 Free PMC article.

References

1. O’malley K, Cook K, Price M, Wildes K, Hurdle J and Ashton C, “Measuring diagnoses: ICD code accuracy,” Health Services Research, vol. 40, no. 5p2, pp. 1620–1639, 2005. - PMC - PubMed
1. Sheppard JE, Weidner LC, Zakai S, Fountain-Polley S and Williams J, “Ambiguous abbreviations: an audit of abbreviations in paediatric note keeping,” Arch. disease childhood, vol. 93, p. 204–206, 2008. - PubMed
1. Lang D, “Consultant report-natural language processing in the health care industry,” Cincinnati Children’s Hospital Medical Center, vol. Winter, no. 6, 2007.
1. Farkas R and Szarvas G, “Automatic construction of rule-based icd-9-cm coding systems,” BMC bioinformatics 9, vol. 9, no. 10, 2008. - PMC - PubMed
1. L. L. and B. C., “Automatic assignment of ICD9 codes to discharge summaries,” Technical report, University of Massachusetts at Amherst, Amherst, MA., 1995.

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

R01 NS098023/NS/NINDS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Automated ICD coding via unsupervised knowledge integration (UNITE)

Affiliations

Automated ICD coding via unsupervised knowledge integration (UNITE)

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources