From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

Alfredo Madrid-García¹, Beatriz Merino-Barbancho², Dalifer Freites-Núñez³, Luis Rodríguez-Rodríguez³, Ernestina Menasalvas-Ruíz⁴, Alejandro Rodríguez-González⁴, Anselmo Peñas⁵

Affiliations

¹ Grupo de Patología Musculoesquelética, Hospital Clínico San Carlos, Instituto de Investigación Sanitaria San Carlos (IdISSC), Prof. Martin Lagos s/n, Madrid, 28040, Spain. Electronic address: alfredo.madrid@salud.madrid.org.
² Escuela Técnica Superior de Ingenieros de Telecomunicación Universidad Politécnica de Madrid, Avenida Complutense, 30, Madrid, 28040, Spain.
³ Grupo de Patología Musculoesquelética, Hospital Clínico San Carlos, Instituto de Investigación Sanitaria San Carlos (IdISSC), Prof. Martin Lagos s/n, Madrid, 28040, Spain.
⁴ Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain; Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain.
⁵ UNED NLP & IR Group Universidad Nacional de Educación a Distancia, Juan del Rosal 16, 28040, Madrid, Spain.

PMID: 39047506
DOI: 10.1016/j.compbiomed.2024.108920

Free article

From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

Alfredo Madrid-García et al. Comput Biol Med. 2024 Sep.

Free article

. 2024 Sep:179:108920.

doi: 10.1016/j.compbiomed.2024.108920. Epub 2024 Jul 23.

Authors

Alfredo Madrid-García¹, Beatriz Merino-Barbancho², Dalifer Freites-Núñez³, Luis Rodríguez-Rodríguez³, Ernestina Menasalvas-Ruíz⁴, Alejandro Rodríguez-González⁴, Anselmo Peñas⁵

Affiliations

¹ Grupo de Patología Musculoesquelética, Hospital Clínico San Carlos, Instituto de Investigación Sanitaria San Carlos (IdISSC), Prof. Martin Lagos s/n, Madrid, 28040, Spain. Electronic address: alfredo.madrid@salud.madrid.org.
² Escuela Técnica Superior de Ingenieros de Telecomunicación Universidad Politécnica de Madrid, Avenida Complutense, 30, Madrid, 28040, Spain.
³ Grupo de Patología Musculoesquelética, Hospital Clínico San Carlos, Instituto de Investigación Sanitaria San Carlos (IdISSC), Prof. Martin Lagos s/n, Madrid, 28040, Spain.
⁴ Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain; Escuela Técnica Superior de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain.
⁵ UNED NLP & IR Group Universidad Nacional de Educación a Distancia, Juan del Rosal 16, 28040, Madrid, Spain.

PMID: 39047506
DOI: 10.1016/j.compbiomed.2024.108920

Abstract

This study introduces RheumaLinguisticpack (RheumaLpack), the first specialised linguistic web corpus designed for the field of musculoskeletal disorders. By combining web mining (i.e., web scraping) and natural language processing (NLP) techniques, as well as clinical expertise, RheumaLpack systematically captures and curates structured and unstructured data across a spectrum of web sources including clinical trials registers (i.e., ClinicalTrials.gov), bibliographic databases (i.e., PubMed), medical agencies (i.e. European Medicines Agency), social media (i.e., Reddit), and accredited health websites (i.e., MedlinePlus, Harvard Health Publishing, and Cleveland Clinic). Given the complexity of rheumatic and musculoskeletal diseases (RMDs) and their significant impact on quality of life, this resource can be proposed as a useful tool to train algorithms that could mitigate the diseases' effects. Therefore, the corpus aims to improve the training of artificial intelligence (AI) algorithms and facilitate knowledge discovery in RMDs. The development of RheumaLpack involved a systematic six-step methodology covering data identification, characterisation, selection, collection, processing, and corpus description. The result is a non-annotated, monolingual, and dynamic corpus, featuring almost 3 million records spanning from 2000 to 2023. RheumaLpack represents a pioneering contribution to rheumatology research, providing a useful resource for the development of advanced AI and NLP applications. This corpus highlights the value of web data to address the challenges posed by musculoskeletal diseases, illustrating the corpus's potential to improve research and treatment paradigms in rheumatology. Finally, the methodology shown can be replicated to obtain data from other medical specialities. The code and details on how to build RheumaLpack are also provided to facilitate the dissemination of such resource.

Keywords: Artificial intelligence; Natural language processing; Rheumatology; Web corpus.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
- ClinicalKey
- Elsevier Science

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

Affiliations

From Web to RheumaLpack: Creating a Linguistic Corpus for Exploitation and Knowledge Discovery in Rheumatology

Authors

Affiliations

Abstract

Conflict of interest statement

References

MeSH terms

LinkOut - more resources

Full Text Sources