Doublet method for very fast autocoding

Jules J Berman¹

Affiliations

PMID: 15369595
PMCID: PMC521082
DOI: 10.1186/1472-6947-4-16

Comparative Study

Doublet method for very fast autocoding

Jules J Berman. BMC Med Inform Decis Mak. 2004.

. 2004 Sep 15:4:16.

doi: 10.1186/1472-6947-4-16.

Author

Jules J Berman¹

Affiliation

¹ Cancer Diagnosis Program, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA. bermanj@mail.nih.gov.

PMID: 15369595
PMCID: PMC521082
DOI: 10.1186/1472-6947-4-16

Abstract

Background: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding.

Methods: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets).

Results: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder.

Conclusions: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.

PubMed Disclaimer

Cited by

A comparison of Intelligent Mapper and document similarity scores for mapping local radiology terms to LOINC.
Vreeman DJ, McDonald CJ. Vreeman DJ, et al. AMIA Annu Symp Proc. 2006;2006:809-13. AMIA Annu Symp Proc. 2006. PMID: 17238453 Free PMC article.
Improved de-identification of physician notes through integrative modeling of both public and private medical text.
McMurry AJ, Fitch B, Savova G, Kohane IS, Reis BY. McMurry AJ, et al. BMC Med Inform Decis Mak. 2013 Oct 2;13:112. doi: 10.1186/1472-6947-13-112. BMC Med Inform Decis Mak. 2013. PMID: 24083569 Free PMC article.
NOBLE - Flexible concept recognition for large-scale biomedical natural language processing.
Tseytlin E, Mitchell K, Legowski E, Corrigan J, Chavan G, Jacobson RS. Tseytlin E, et al. BMC Bioinformatics. 2016 Jan 14;17:32. doi: 10.1186/s12859-015-0871-y. BMC Bioinformatics. 2016. PMID: 26763894 Free PMC article.
Automatic extraction of candidate nomenclature terms using the doublet method.
Berman JJ. Berman JJ. BMC Med Inform Decis Mak. 2005 Oct 18;5:35. doi: 10.1186/1472-6947-5-35. BMC Med Inform Decis Mak. 2005. PMID: 16232314 Free PMC article.
SPIN query tools for de-identified research on a humongous database.
McDonald CJ, Dexter P, Schadow G, Chueh HC, Abernathy G, Hook J, Blevins L, Overhage JM, Berman JJ. McDonald CJ, et al. AMIA Annu Symp Proc. 2005;2005:515-9. AMIA Annu Symp Proc. 2005. PMID: 16779093 Free PMC article.

References

1. Berman JJ. Tumor classification: molecular analysis meets Aristotle. BMC Cancer. 2004;4:10. doi: 10.1186/1471-2407-4-10. - DOI - PMC - PubMed
1. Franz P, Zaiss A, Schulz S, Hahn U, Klar R. Automated coding of diagnoses: three methods compared. Proc AMIA Symp. 2000:250–254. - PMC - PubMed
1. Heja G, Surjan G. Using n-gram method in the decomposition of compound medical diagnoses. Int J Med Inf. 2003;70:229–236. doi: 10.1016/S1386-5056(03)00049-2. - DOI - PubMed
1. Kim W, Wilbur WJ. Corpus-based statistical screening for phrase identification. J Am Med Inform Assoc. 2000;7:499–511. - PMC - PubMed
1. PubMed http://www.pubmed.org

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Doublet method for very fast autocoding

Affiliation

Doublet method for very fast autocoding

Author

Affiliation

Abstract

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources