Doublet method for very fast autocoding
- PMID: 15369595
- PMCID: PMC521082
- DOI: 10.1186/1472-6947-4-16
Doublet method for very fast autocoding
Abstract
Background: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding.
Methods: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets).
Results: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder.
Conclusions: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.
Similar articles
-
Automatic extraction of candidate nomenclature terms using the doublet method.BMC Med Inform Decis Mak. 2005 Oct 18;5:35. doi: 10.1186/1472-6947-5-35. BMC Med Inform Decis Mak. 2005. PMID: 16232314 Free PMC article.
-
Resources for comparing the speed and performance of medical autocoders.BMC Med Inform Decis Mak. 2004 Jun 15;4:8. doi: 10.1186/1472-6947-4-8. BMC Med Inform Decis Mak. 2004. PMID: 15198804 Free PMC article.
-
Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching.In Silico Biol. 2005;5(3):313-22. Epub 2005 Apr 3. In Silico Biol. 2005. PMID: 15984939
-
Standard data model representation for taxonomic information.OMICS. 2006 Summer;10(2):220-30. doi: 10.1089/omi.2006.10.220. OMICS. 2006. PMID: 16901230 Review.
-
"The Brain is…": A Survey of the Brain's Many Definitions.Neuroinformatics. 2025 Jan 11;23(1):4. doi: 10.1007/s12021-024-09699-x. Neuroinformatics. 2025. PMID: 39798046 Free PMC article. Review.
Cited by
-
A comparison of Intelligent Mapper and document similarity scores for mapping local radiology terms to LOINC.AMIA Annu Symp Proc. 2006;2006:809-13. AMIA Annu Symp Proc. 2006. PMID: 17238453 Free PMC article.
-
Improved de-identification of physician notes through integrative modeling of both public and private medical text.BMC Med Inform Decis Mak. 2013 Oct 2;13:112. doi: 10.1186/1472-6947-13-112. BMC Med Inform Decis Mak. 2013. PMID: 24083569 Free PMC article.
-
NOBLE - Flexible concept recognition for large-scale biomedical natural language processing.BMC Bioinformatics. 2016 Jan 14;17:32. doi: 10.1186/s12859-015-0871-y. BMC Bioinformatics. 2016. PMID: 26763894 Free PMC article.
-
Automatic extraction of candidate nomenclature terms using the doublet method.BMC Med Inform Decis Mak. 2005 Oct 18;5:35. doi: 10.1186/1472-6947-5-35. BMC Med Inform Decis Mak. 2005. PMID: 16232314 Free PMC article.
-
SPIN query tools for de-identified research on a humongous database.AMIA Annu Symp Proc. 2005;2005:515-9. AMIA Annu Symp Proc. 2005. PMID: 16779093 Free PMC article.
References
-
- PubMed http://www.pubmed.org
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources