Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Feb 22;21(1):69.
doi: 10.1186/s12911-021-01395-z.

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Affiliations

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Leonardo Campillos-Llanos et al. BMC Med Inform Decis Mak. .

Erratum in

Abstract

Background: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus.

Methods: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models.

Results: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure.

Conclusions: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html . The methods are generalizable to other languages with similar available sources.

Keywords: Clinical Trials; Evidence-Based Medicine; Inter-Annotator Agreement; Natural Language Processing; Semantic Annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Sample of the annotation
Fig. 2
Fig. 2
Sample of nested annotations
Fig. 3
Fig. 3
Distribution of annotated entity types (in percentage)
Fig. 4
Fig. 4
Therapeutic areas of texts (codes correspond to MeSH tree numbers)
Fig. 5
Fig. 5
IAA per entity type (strict)
Fig. 6
Fig. 6
IAA per entity type (relaxed)
Fig. 7
Fig. 7
IAA values per pair of annotators and with regard to consensus (C) annotations

Similar articles

Cited by

References

    1. Sackett D, Strauss D, Richardson W, Rosenberg W, Haynes R. Evidence-based medicine: how to practice and teach EBM. Churchill Livingstone, Edinburgh, 2nd Ed. (2000)
    1. National Library of Medicine. ClinicalTrials.gov;. https://clinicaltrials.gov/. Accessed 5 Sep 2020.
    1. European Medicines Agency. European Union Clinical Trials Register (EudraCT). http://www.clinicaltrialsregister.eu. Accessed 5 Sep 2020.
    1. McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform. 2001;84(01):216–20. - PMC - PubMed
    1. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–D270. doi: 10.1093/nar/gkh061. - DOI - PMC - PubMed

Publication types