A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine
- PMID: 33618727
- PMCID: PMC7898014
- DOI: 10.1186/s12911-021-01395-z
A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine
Erratum in
-
Correction to: A clinical trials corpus annotated with UMLS entities to enhance the access to evidence‑based medicine.BMC Med Inform Decis Mak. 2021 Apr 7;21(1):118. doi: 10.1186/s12911-021-01475-0. BMC Med Inform Decis Mak. 2021. PMID: 33827568 Free PMC article. No abstract available.
Abstract
Background: The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus.
Methods: We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models.
Results: This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure.
Conclusions: Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html . The methods are generalizable to other languages with similar available sources.
Keywords: Clinical Trials; Evidence-Based Medicine; Inter-Annotator Agreement; Natural Language Processing; Semantic Annotation.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures







Similar articles
-
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.BMC Bioinformatics. 2025 Jan 8;26(1):7. doi: 10.1186/s12859-024-05949-6. BMC Bioinformatics. 2025. PMID: 39780059 Free PMC article.
-
Assessment of disease named entity recognition on a corpus of annotated sentences.BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3. BMC Bioinformatics. 2008. PMID: 18426548 Free PMC article.
-
Terminologies augmented recurrent neural network model for clinical named entity recognition.J Biomed Inform. 2020 Feb;102:103356. doi: 10.1016/j.jbi.2019.103356. Epub 2019 Dec 16. J Biomed Inform. 2020. PMID: 31837473
-
Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements.J Am Med Inform Assoc. 2014 May-Jun;21(3):406-13. doi: 10.1136/amiajnl-2013-001837. Epub 2013 Sep 3. J Am Med Inform Assoc. 2014. PMID: 24001514 Free PMC article.
-
An analysis on the entity annotations in biological corpora.F1000Res. 2014 Apr 25;3:96. doi: 10.12688/f1000research.3216.1. eCollection 2014. F1000Res. 2014. PMID: 25254099 Free PMC article. Review.
Cited by
-
CoQUAD: a COVID-19 question answering dataset system, facilitating research, benchmarking, and practice.BMC Bioinformatics. 2022 Jun 2;23(1):210. doi: 10.1186/s12859-022-04751-6. BMC Bioinformatics. 2022. PMID: 35655148 Free PMC article.
-
A comparative analysis of Spanish Clinical encoder-based models on NER and classification tasks.J Am Med Inform Assoc. 2024 Sep 1;31(9):2137-2146. doi: 10.1093/jamia/ocae054. J Am Med Inform Assoc. 2024. PMID: 38489543 Free PMC article.
-
Extract antibody and antigen names from biomedical literature.BMC Bioinformatics. 2022 Dec 6;23(1):524. doi: 10.1186/s12859-022-04993-4. BMC Bioinformatics. 2022. PMID: 36474140 Free PMC article.
-
The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data.BMC Med Inform Decis Mak. 2024 Dec 28;24(1):409. doi: 10.1186/s12911-024-02825-4. BMC Med Inform Decis Mak. 2024. PMID: 39732668 Free PMC article.
-
Hybrid natural language processing tool for semantic annotation of medical texts in Spanish.BMC Bioinformatics. 2025 Jan 8;26(1):7. doi: 10.1186/s12859-024-05949-6. BMC Bioinformatics. 2025. PMID: 39780059 Free PMC article.
References
-
- Sackett D, Strauss D, Richardson W, Rosenberg W, Haynes R. Evidence-based medicine: how to practice and teach EBM. Churchill Livingstone, Edinburgh, 2nd Ed. (2000)
-
- National Library of Medicine. ClinicalTrials.gov;. https://clinicaltrials.gov/. Accessed 5 Sep 2020.
-
- European Medicines Agency. European Union Clinical Trials Register (EudraCT). http://www.clinicaltrialsregister.eu. Accessed 5 Sep 2020.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous