. 2021 Feb 22;21(1):69.

doi: 10.1186/s12911-021-01395-z.

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Leonardo Campillos-Llanos¹, Ana Valverde-Mateos², Adrián Capllonch-Carrión³, Antonio Moreno-Sandoval⁴

Affiliations

¹ Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049, Madrid, Spain. leonardo.campillos@uam.es.
² Medical Terminology Unit, Spanish Royal Academy of Medicine., C/Arrieta 12, 28013, Madrid, Spain.
³ Complejo Asistencial Hospital Benito Menni., C/Jardines 1, 28350, Ciempozuelos, Madrid, Spain.
⁴ Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049, Madrid, Spain.

PMID: 33618727
PMCID: PMC7898014
DOI: 10.1186/s12911-021-01395-z

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Leonardo Campillos-Llanos et al. BMC Med Inform Decis Mak. 2021.

. 2021 Feb 22;21(1):69.

doi: 10.1186/s12911-021-01395-z.

Authors

Leonardo Campillos-Llanos¹, Ana Valverde-Mateos², Adrián Capllonch-Carrión³, Antonio Moreno-Sandoval⁴

Affiliations

¹ Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049, Madrid, Spain. leonardo.campillos@uam.es.
² Medical Terminology Unit, Spanish Royal Academy of Medicine., C/Arrieta 12, 28013, Madrid, Spain.
³ Complejo Asistencial Hospital Benito Menni., C/Jardines 1, 28350, Ciempozuelos, Madrid, Spain.
⁴ Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049, Madrid, Spain.

PMID: 33618727
PMCID: PMC7898014
DOI: 10.1186/s12911-021-01395-z

Erratum in

Correction to: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence‑based medicine.
Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. Campillos-Llanos L, et al. BMC Med Inform Decis Mak. 2021 Apr 7;21(1):118. doi: 10.1186/s12911-021-01475-0. BMC Med Inform Decis Mak. 2021. PMID: 33827568 Free PMC article. No abstract available.

Abstract

Background: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus.

Methods: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models.

Results: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure.

Conclusions: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html . The methods are generalizable to other languages with similar available sources.

Keywords: Clinical Trials; Evidence-Based Medicine; Inter-Annotator Agreement; Natural Language Processing; Semantic Annotation.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 3**
Distribution of annotated entity types (in percentage)

**Fig. 4**
Therapeutic areas of texts (codes correspond to MeSH tree numbers)

**Fig. 6**
IAA per entity type (relaxed)

**Fig. 7**
IAA values per pair of annotators and with regard to consensus (C) annotations

See this image and copyright information in PMC

Cited by

CoQUAD: a COVID-19 question answering dataset system, facilitating research, benchmarking, and practice.
Raza S, Schwartz B, Rosella LC. Raza S, et al. BMC Bioinformatics. 2022 Jun 2;23(1):210. doi: 10.1186/s12859-022-04751-6. BMC Bioinformatics. 2022. PMID: 35655148 Free PMC article.
A comparative analysis of Spanish Clinical encoder-based models on NER and classification tasks.
García Subies G, Barbero Jiménez Á, Martínez Fernández P. García Subies G, et al. J Am Med Inform Assoc. 2024 Sep 1;31(9):2137-2146. doi: 10.1093/jamia/ocae054. J Am Med Inform Assoc. 2024. PMID: 38489543 Free PMC article.
Extract antibody and antigen names from biomedical literature.
Dinh TT, Vo-Chanh TP, Nguyen C, Huynh VQ, Vo N, Nguyen HD. Dinh TT, et al. BMC Bioinformatics. 2022 Dec 6;23(1):524. doi: 10.1186/s12859-022-04993-4. BMC Bioinformatics. 2022. PMID: 36474140 Free PMC article.
The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data.
Diaz Ochoa JG, Mustafa FE, Weil F, Wang Y, Kama K, Knott M. Diaz Ochoa JG, et al. BMC Med Inform Decis Mak. 2024 Dec 28;24(1):409. doi: 10.1186/s12911-024-02825-4. BMC Med Inform Decis Mak. 2024. PMID: 39732668 Free PMC article.
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.
Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A. Campillos-Llanos L, et al. BMC Bioinformatics. 2025 Jan 8;26(1):7. doi: 10.1186/s12859-024-05949-6. BMC Bioinformatics. 2025. PMID: 39780059 Free PMC article.

See all "Cited by" articles

References

1. Sackett D, Strauss D, Richardson W, Rosenberg W, Haynes R. Evidence-based medicine: how to practice and teach EBM. Churchill Livingstone, Edinburgh, 2nd Ed. (2000)
1. National Library of Medicine. ClinicalTrials.gov;. https://clinicaltrials.gov/. Accessed 5 Sep 2020.
1. European Medicines Agency. European Union Clinical Trials Register (EudraCT). http://www.clinicaltrialsregister.eu. Accessed 5 Sep 2020.
1. McCray AT, Burgun A, Bodenreider O. Aggregating UMLS semantic types for reducing conceptual complexity. Stud Health Technol Inform. 2001;84(01):216–20. - PMC - PubMed
1. Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl 1):D267–D270. doi: 10.1093/nar/gkh061. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Affiliations

A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials

Miscellaneous