Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2000 Jul-Aug;7(4):378-91.
doi: 10.1136/jamia.2000.0070378.

Fast exact string pattern-matching algorithms adapted to the characteristics of the medical language

Affiliations

Fast exact string pattern-matching algorithms adapted to the characteristics of the medical language

C Lovis et al. J Am Med Inform Assoc. 2000 Jul-Aug.

Abstract

Objective: The authors consider the problem of exact string pattern matching using algorithms that do not require any preprocessing. To choose the most appropriate algorithm, distinctive features of the medical language must be taken into account. The characteristics of medical language are emphasized in this regard, the best algorithm of those reviewed is proposed, and detailed evaluations of time complexity for processing medical texts are provided.

Design: The authors first illustrate and discuss the techniques of various string pattern-matching algorithms. Next, the source code and the behavior of representative exact string pattern-matching algorithms are presented in a comprehensive manner to promote their implementation. Detailed explanations of the use of various techniques to improve performance are given.

Measurements: Real-time measures of time complexity with English medical texts are presented. They lead to results distinct from those found in the computer science literature, which are typically computed with normally distributed texts.

Results: The Boyer-Moore-Horspool algorithm achieves the best overall results when used with medical texts. This algorithm usually performs at least twice as fast as the other algorithms tested.

Conclusion: The time performance of exact string pattern matching can be greatly improved if an efficient algorithm is used. Considering the growing amount of text handled in the electronic patient record, it is worth implementing this efficient algorithm.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example of exact pattern matching in case of common roots.
Figure 2
Figure 2
The Naive algorithm.
Figure 3
Figure 3
The Karp-Rabin algorithm.
Figure 4
Figure 4
The Knuth-Morris-Pratt algorithm.
Figure 5
Figure 5
The Boyer-Moore-Horspool algorithm.
Figure 6
Figure 6
Overview of the test method.
Figure 7
Figure 7
Baseline measures determining the noise of the system.
Figure 8
Figure 8
Complexity is linear in time.
Figure 9
Figure 9
Overall performance of the five algorithms tested.

References

    1. Friedman C, Hripcsak G, Shagina L, Liu H. Representing information in patient reports using natural language processing and the extensible markup language. J Am Med Inform Assoc. 1999;6(1): 76-87. - PMC - PubMed
    1. Stein HD, Nadkarni P, Erdos J, Miller PL. Exploring the degree of concordance of coded and textual data in answering clinical queries from a clinical data repository. J Am Med Inform Assoc. 2000;7: 42-54. - PMC - PubMed
    1. Charlet J, Bachimont B, Brunie V, el Kassar S, Zweigenbaum P, Boisvieux JF. Hospitexte: toward a document-based hypertextual electronic medical record. Proc AMIA Symp. 1998: 713-7. - PMC - PubMed
    1. Lovis C, Baud RH, Planche P. Power of expression in the electronic patient record: structured data or narrative text? Submitted to Int J Med Inform. - PubMed
    1. Lovis C, Baud RH, Scherrer JR. Internet integrated in the daily medical practice within an electronic patient record. Comput Biol Med. 1998;28: 567-79. - PubMed

Publication types

MeSH terms