The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms
- PMID: 34879250
- DOI: 10.1016/j.jbi.2021.103961
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms
Abstract
Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6-8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.
Keywords: Gold-standard corpus; Named Entity Recognition; Rare Diseases; Relation Extraction.
Copyright © 2021 The Author(s). Published by Elsevier Inc. All rights reserved.
Similar articles
-
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts.BMC Bioinformatics. 2022 Jul 6;23(1):263. doi: 10.1186/s12859-022-04810-y. BMC Bioinformatics. 2022. PMID: 35794528 Free PMC article.
-
The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions.J Biomed Inform. 2013 Oct;46(5):914-20. doi: 10.1016/j.jbi.2013.07.011. Epub 2013 Jul 29. J Biomed Inform. 2013. PMID: 23906817
-
Building a comprehensive syntactic and semantic corpus of Chinese clinical texts.J Biomed Inform. 2017 May;69:203-217. doi: 10.1016/j.jbi.2017.04.006. Epub 2017 Apr 9. J Biomed Inform. 2017. PMID: 28404537
-
Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review.J Am Med Inform Assoc. 2019 Apr 1;26(4):364-379. doi: 10.1093/jamia/ocy173. J Am Med Inform Assoc. 2019. PMID: 30726935 Free PMC article.
-
Natural language processing in biomedicine: a unified system architecture overview.Methods Mol Biol. 2014;1168:275-94. doi: 10.1007/978-1-4939-0847-9_16. Methods Mol Biol. 2014. PMID: 24870142 Review.
Cited by
-
Large Language Models Struggle in Token-Level Clinical Named Entity Recognition.AMIA Annu Symp Proc. 2025 May 22;2024:748-757. eCollection 2024. AMIA Annu Symp Proc. 2025. PMID: 40417588 Free PMC article.
-
An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study.JMIR Med Inform. 2024 Dec 18;12:e60665. doi: 10.2196/60665. JMIR Med Inform. 2024. PMID: 39693482 Free PMC article.
-
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts.BMC Bioinformatics. 2022 Jul 6;23(1):263. doi: 10.1186/s12859-022-04810-y. BMC Bioinformatics. 2022. PMID: 35794528 Free PMC article.
-
Using Clinician-Patient WeChat Group Communication Data to Identify Symptom Burdens in Patients With Uterine Fibroids Under Focused Ultrasound Ablation Surgery Treatment: Qualitative Study.JMIR Form Res. 2023 Sep 1;7:e43995. doi: 10.2196/43995. JMIR Form Res. 2023. PMID: 37656501 Free PMC article.
-
Not Fully Synthetic: LLM-based Hybrid Approaches Towards Privacy-Preserving Clinical Note Sharing.AMIA Jt Summits Transl Sci Proc. 2025 Jun 10;2025:441-450. eCollection 2025. AMIA Jt Summits Transl Sci Proc. 2025. PMID: 40502247 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical