Term-BLAST-like alignment tool for concept recognition in noisy clinical texts
- PMID: 38001031
- PMCID: PMC10710372
- DOI: 10.1093/bioinformatics/btad716
Term-BLAST-like alignment tool for concept recognition in noisy clinical texts
Abstract
Motivation: Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts.
Results: Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches.
Availability and implementation: Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0.
© The Author(s) 2023. Published by Oxford University Press.
Conflict of interest statement
None declared.
Figures


Similar articles
-
Cimind: A phonetic-based tool for multilingual named entity recognition in biomedical texts.J Biomed Inform. 2019 Jun;94:103176. doi: 10.1016/j.jbi.2019.103176. Epub 2019 Apr 11. J Biomed Inform. 2019. PMID: 30980962
-
MLM-based typographical error correction of unstructured medical texts for named entity recognition.BMC Bioinformatics. 2022 Nov 16;23(1):486. doi: 10.1186/s12859-022-05035-9. BMC Bioinformatics. 2022. PMID: 36384464 Free PMC article.
-
Extracting clinical named entity for pituitary adenomas from Chinese electronic medical records.BMC Med Inform Decis Mak. 2022 Mar 23;22(1):72. doi: 10.1186/s12911-022-01810-z. BMC Med Inform Decis Mak. 2022. PMID: 35321705 Free PMC article.
-
Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier.Bioinformatics. 2015 Apr 15;31(8):1310-2. doi: 10.1093/bioinformatics/btu825. Epub 2014 Dec 12. Bioinformatics. 2015. PMID: 25505086 Free PMC article.
-
Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies.J Biomed Semantics. 2020 Nov 16;11(1):14. doi: 10.1186/s13326-020-00231-z. J Biomed Semantics. 2020. PMID: 33198814 Free PMC article.
Cited by
-
Pheno-Ranker: a toolkit for comparison of phenotypic data stored in GA4GH standards and beyond.BMC Bioinformatics. 2024 Dec 4;25(1):373. doi: 10.1186/s12859-024-05993-2. BMC Bioinformatics. 2024. PMID: 39633268 Free PMC article.
-
FastHPOCR: pragmatic, fast, and accurate concept recognition using the human phenotype ontology.Bioinformatics. 2024 Jul 1;40(7):btae406. doi: 10.1093/bioinformatics/btae406. Bioinformatics. 2024. PMID: 38913850 Free PMC article.
-
Long-term (10-year) monitoring of transposon-mediated transgenic cattle.Transgenic Res. 2024 Oct;33(5):503-512. doi: 10.1007/s11248-024-00401-0. Epub 2024 Aug 28. Transgenic Res. 2024. PMID: 39196515 Free PMC article.
References
-
- Altschul SF, Gish W, Miller W. et al. Basic local alignment search tool. J Mol Biol 1990;215:403–10. - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials
Miscellaneous