Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Dhouha Grissa¹, Alexander Junge¹, Tudor I Oprea^{1

2}, Lars Juhl Jensen¹

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
² Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, Albuquerque, NM, USA.

PMID: 35348648
PMCID: PMC9216524
DOI: 10.1093/database/baac019

Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Dhouha Grissa et al. Database (Oxford). 2022.

. 2022 Mar 28:2022:baac019.

doi: 10.1093/database/baac019.

Authors

Dhouha Grissa¹, Alexander Junge¹, Tudor I Oprea^{1

2}, Lars Juhl Jensen¹

Affiliations

¹ Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
² Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, Albuquerque, NM, USA.

PMID: 35348648
PMCID: PMC9216524
DOI: 10.1093/database/baac019

Abstract

The scientific knowledge about which genes are involved in which diseases grows rapidly, which makes it difficult to keep up with new publications and genetics datasets. The DISEASES database aims to provide a comprehensive overview by systematically integrating and assigning confidence scores to evidence for disease-gene associations from curated databases, genome-wide association studies (GWAS) and automatic text mining of the biomedical literature. Here, we present a major update to this resource, which greatly increases the number of associations from all these sources. This is especially true for the text-mined associations, which have increased by at least 9-fold at all confidence cutoffs. We show that this dramatic increase is primarily due to adding full-text articles to the text corpus, secondarily due to improvements to both the disease and gene dictionaries used for named entity recognition, and only to a very small extent due to the growth in number of PubMed abstracts. DISEASES now also makes use of a new GWAS database, Target Illumination by GWAS Analytics, which considerably increased the number of GWAS-derived disease-gene associations. DISEASES itself is also integrated into several other databases and resources, including GeneCards/MalaCards, Pharos/Target Central Resource Database and the Cytoscape stringApp. All data in DISEASES are updated on a weekly basis and is available via a web interface at https://diseases.jensenlab.org, from where it can also be downloaded under open licenses. Database URL: https://diseases.jensenlab.org.

PubMed Disclaimer

Figures

**Figure 1.**
Overview of disease–gene associations in DISEASES. The number of disease–gene associations with a confidence score of at least 3 stars is proportional to the area of the pie charts, which represent high-level terms from Disease Ontology. In each pie chart, the associations are broken down by evidence type, i.e. curated knowledge, GWAS experiments and automatic text mining of the literature.

**Figure 2.**
Performance improvement of the *text mining* channel. As shown in the ROC curves, text mining performs markedly better in the new version of DISEASES (Dict2,FullText2021) compared to the originally published one (Dict1,PubMed2013). To quantify the sources of improvements, we show two additional curves: one using the new dictionary on the latest abstract collection only (Dict2,PubMed2021) and another using the old dictionary on the same abstracts (Dict1,PubMed2021). Comparing the curves reveals that most of the improvement stems from the addition of full-text articles, but that the new disease and gene dictionaries also led to considerable improvement. By contrast, the growth in PubMed abstracts from 2013 to 2021 made only a minor difference. The insert shows a zoom of the high-confidence part of the plot.

See this image and copyright information in PMC

References

1. Pletscher-Frankild S., Pallejã A., Tsafou K. et al. (2015) DISEASES: text mining and data integration of disease-gene associations. Methods, 74, 83–89. doi: 10.1016/j.ymeth.2014.11.020 - DOI - PubMed
1. Westergaard D., Stærfeldt H.-H., Tønsberg C. et al. (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput. Biol., 14, 1–16. doi: 10.1371/journal.pcbi.1005962 - DOI - PMC - PubMed
1. Comeau D.C., Wei C.-H., Islamaj Doǧan R. et al. (2019) PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics, 35, 3533–3535. doi: 10.1093/bioinformatics/btz070 - DOI - PMC - PubMed
1. Pandi M.-T., van der Spek P.J., Koromina M. et al. (2020) A novel text-mining approach for retrieving pharmacogenomics associations from the literature. Front. Pharmacol., 11, 602030. doi: 10.3389/fphar.2020.602030 - DOI - PMC - PubMed
1. Karadeniz I., Hur J., He Y. et al. (2015) Literature mining and ontology based analysis of host-Brucella gene–gene interaction network. Front. Microbiol., 6, 1386. doi: 10.3389/fmicb.2015.01386 - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Affiliations

Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources