Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2003 Jun 16:3:6.
doi: 10.1186/1472-6947-3-6. Epub 2003 Jun 16.

A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies

Affiliations

A tool for sharing annotated research data: the "Category 0" UMLS (Unified Medical Language System) vocabularies

Jules J Berman. BMC Med Inform Decis Mak. .

Abstract

Background: Large biomedical data sets have become increasingly important resources for medical researchers. Modern biomedical data sets are annotated with standard terms to describe the data and to support data linking between databases. The largest curated listing of biomedical terms is the the National Library of Medicine's Unified Medical Language System (UMLS). The UMLS contains more than 2 million biomedical terms collected from nearly 100 medical vocabularies. Many of the vocabularies contained in the UMLS carry restrictions on their use, making it impossible to share or distribute UMLS-annotated research data. However, a subset of the UMLS vocabularies, designated Category 0 by UMLS, can be used to annotate and share data sets without violating the UMLS License Agreement.

Methods: The UMLS Category 0 vocabularies can be extracted from the parent UMLS metathesaurus using a Perl script supplied with this article. There are 43 Category 0 vocabularies that can be used freely for research purposes without violating the UMLS License Agreement. Among the Category 0 vocabularies are: MESH (Medical Subject Headings), NCBI (National Center for Bioinformatics) Taxonomy and ICD-9-CM (International Classification of Diseases-9-Clinical Modifiers).

Results: The extraction file containing all Category 0 terms and concepts is 72,581,138 bytes in length and contains 1,029,161 terms. The UMLS Metathesaurus MRCON file (January, 2003) is 151,048,493 bytes in length and contains 2,146,899 terms. Therefore the Category 0 vocabularies, in aggregate, are about half the size of the UMLS metathesaurus.A large publicly available listing of 567,921 different medical phrases were automatically coded using the full UMLS metatathesaurus and the Category 0 vocabularies. There were 545,321 phrases with one or more matches against UMLS terms while 468,785 phrases had one or more matches against the Category 0 terms. This indicates that when the two vocabularies are evaluated by their fitness to find at least one term for a medical phrase, the Category 0 vocabularies performed 86% as well as the complete UMLS metathesaurus.

Conclusion: The Category 0 vocabularies of UMLS constitute a large nomenclature that can be used by biomedical researchers to annotate biomedical data. These annotated data sets can be distributed for research purposes without violating the UMLS License Agreement. These vocabularies may be of particular importance for sharing heterogeneous data from diverse biomedical data sets. The software tools to extract the Category 0 vocabularies are freely available Perl scripts entered into the public domain and distributed with this article.

PubMed Disclaimer

References

    1. Final NIH Statement on Sharing Research Data http://grants1.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html Feb 26, 2003.
    1. Comment Letter on NIH Data Sharing Proposal from the American Association of Medical Colleges http://www.aamc.org/advocacy/library/research/corres/2002/051102.htm May 10, 2002.
    1. Berman JJ. Abstract presented at Advancing Pathology Informatics, Imaging and the Internet. Pittsburgh, PA; A Perl script to produce an unencumbered subset of the unified medical language system. Oct 2–4, 2002.
    1. Berman JJ. Concept-Match Medical Data Scrubbing: How pathology text can be used in research. Arch Pathol Lab Med. 2003;127:680–686. - PubMed
    1. Malet G, Munoz F, Appleyard R, Hersh W. A model for enhancing Internet medical document retrieval with "medical core metadata". J Am Med Inform Assoc. 1999;6:163–172. - PMC - PubMed

MeSH terms

LinkOut - more resources