Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2002 Nov-Dec;9(6):612-20.
doi: 10.1197/jamia.m1139.

Creating an online dictionary of abbreviations from MEDLINE

Affiliations

Creating an online dictionary of abbreviations from MEDLINE

Jeffrey T Chang et al. J Am Med Inform Assoc. 2002 Nov-Dec.

Abstract

Objective: The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions.

Design: Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune.

Measurements: We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database.

Results: On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database.

Conclusion: We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[http://abbreviation.stanford.edu/].

PubMed Disclaimer

Figures

Figure 1
Figure 1
System architecture. We used a machine-learning approach to find and score abbreviations. First, we scan text to find possible abbreviations, align them with their prefix strings, and then collect a feature vector based on eight characteristics of the abbreviation and alignment. Finally, we apply binary logistic regression to generate a score from the feature vector.
Figure 2
Figure 2
Abbreviation server. Our abbreviation server supports queries by abbreviation or keyword.
Figure 3
Figure 3
Abbreviations Predicted in Medstract Gold Standard. We calculated the recall and precision of the abbreviations found with every possible score cutoff. Some scores are labelled on the curve. When the score cutoff is 0.14, seven of the abbreviations the algorithm found were not identified in the gold standard but nevertheless looked correct (primary ethylene response element (PERE), basic helix-loop-helix (bHLH), intermediate neuroblasts defective (ind), Ca2+-sensing receptor (CaSR), GABA(B) receptor (GABA(B)R1), Polymerase II (Pol II), GABAB receptor (GABA(B)R2)). The arrow points to the adjusted performance if these abbreviations had been included in Medstract. The performance of the Acromed system on this gold standard, as reported in Pustejovsky et al., is shown for comparison.
Figure 4
Figure 4
Growth of abstracts and abbreviations. The number of abstracts and abbreviations added to MEDLINE steadily increases.
Figure 5
Figure 5
Scores of correct abbreviations from the China Medical Tribune. Using a score cutoff of 0.90 yields a recall of 68%, 0.14 87%, and 0.03 88%.

References

    1. Iliopoulos I, Enright A, Ouzounis C. Textquest: Document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput. 2001;384–95. - PubMed
    1. Andrade M, Valencia A. Automatic annotation for biological sequences by extraction of keywords from medline abstracts. development of a prototype system. Proc Int Conf Intell Syst Mol Biol. 1997; 5:25–32. - PubMed
    1. Jenssen T, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001;28 (1):21–8. - PubMed
    1. Opaui guide to lists of acronyms, abbreviations, and initialisms on the worldwide web: <http://www.opaui.com/acro.html>.
    1. Acronyms and initialisms for health information resources: <http://www.geocities.com/~mlshams/acronym/acr.htm>.

Publication types

LinkOut - more resources