Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr:174:234-247.

Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence

Affiliations

Context-Sensitive Spelling Correction of Clinical Text via Conditional Independence

Juyong Kim et al. Proc Mach Learn Res. 2022 Apr.

Abstract

Spelling correction is a particularly important problem in clinical natural language processing because of the abundant occurrence of misspellings in medical records. However, the scarcity of labeled datasets in a clinical context makes it hard to build a machine learning system for such clinical spelling correction. In this work, we present a probabilistic model of correcting misspellings based on a simple conditional independence assumption, which leads to a modular decomposition into a language model and a corruption model. With a deep character-level language model trained on a large clinical corpus, and a simple edit-based corruption model, we can build a spelling correction model with small or no real data. Experimental results show that our model significantly outperforms baselines on two healthcare spelling correction datasets.

PubMed Disclaimer

Figures

Figure 4:
Figure 4:
Beam search decoding results for several examples of the CSpell test set. For each example, we display the top 10 beam candidates. The column next to the candidate (Score) shows the final beam score for each candidate.
Figure 1:
Figure 1:
Graphical model of our conditional independence model. The context and the typo are observed and the correct word is unobserved.
Figure 2:
Figure 2:
Beam search of CIM on the Example 1 at time step t=3. The beam candidates are ranked by the sum of the language model score (LM) and the corruption model score (ED). The hyper-parameters of the corruption model are C=5.0 and n=1. The beam width is chosen to B=1 for clear visualization.
Figure 3:
Figure 3:
Beam search decoding examples. For each example, we display the top 10 beam candidates. The column next to the candidate (Score) shows the final beam score for each candidate.

References

    1. Brill Eric and Moore Robert C. An improved error model for noisy channel spelling correction. In Proceedings of the 38th annual meeting of the association for computational linguistics, pages 286–293, 2000.
    1. Damerau Fred J. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, 1964.
    1. Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv.org, October 2018.
    1. DWYL. List of english words. https://github.com/dwyl/english-words, 2020. Commit on Oct 15, 2020.
    1. Fivez Pieter, Šuster Simon, and Daelemans Walter. Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In BioNLP 2017, pages 143–148, August 2017. doi: 10.18653/v1/W17-2317. URL https://www.aclweb.org/anthology/W17-2317. - DOI

LinkOut - more resources