Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2000 Sep-Oct;7(5):499-511.
doi: 10.1136/jamia.2000.0070499.

Corpus-based statistical screening for phrase identification

Affiliations

Corpus-based statistical screening for phrase identification

W Kim et al. J Am Med Inform Assoc. 2000 Sep-Oct.

Abstract

Purpose: The authors study the extraction of useful phrases from a natural language database by statistical methods. The aim is to leverage human effort by providing preprocessed phrase lists with a high percentage of useful material.

Method: The approach is to develop six different scoring methods that are based on different aspects of phrase occurrence. The emphasis here is not on lexical information or syntactic structure but rather on the statistical properties of word pairs and triples that can be obtained from a large database.

Measurements: The Unified Medical Language System (UMLS) incorporates a large list of humanly acceptable phrases in the medical field as a part of its structure. The authors use this list of phrases as a gold standard for validating their methods. A good method is one that ranks the UMLS phrases high among all phrases studied. Measurements are 11-point average precision values and precision-recall curves based on the rankings.

Result: The authors find of six different scoring methods that each proves effective in identifying UMLS quality phrases in a large subset of MEDLINE. These methods are applicable both to word pairs and word triples. All six methods are optimally combined to produce composite scoring methods that are more effective than any single method. The quality of the composite methods appears sufficient to support the automatic placement of hyperlinks in text at the site of highly ranked phrases.

Conclusion: Statistical scoring methods provide a promising approach to the extraction of useful phrases from a natural language database for the purpose of indexing or providing hyperlinks in text.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Recall-precision curves for the scoring methods applied to the set M2. The 11 precision values are interpolated so that precision is a nonincreasing function of recall.
Figure 2
Figure 2
Recall-precision curves for the scoring methods applied to the set M3. The 11 precision values are interpolated so that precision is a nonincreasing function of recall.

References

    1. Funk ME, Reid CA, McGoogan LS. Indexing consistency in MEDLINE. Bull MLA. 1983;71(2): 176-83. - PMC - PubMed
    1. Furnas GW, Landauer TK, Dumais ST, Gomez LM. Statistical semantics analysis of the potential performance of keyword information systems. Bell System Tech J. 1983;62: 1753-806.
    1. Blair DC, Maron ME. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM. 1985;28(3): 289-99.
    1. Bates MJ. Subject access in online catalogs: a design model. J Am Soc Info Sci. 1986;37: 357-76.
    1. Bates MJ. Rethinking subject cataloging in the online environment. Libr Resources Tech Serv. 1989;33: 400-12.