Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 28:2022:baac019.
doi: 10.1093/database/baac019.

Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Affiliations

Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration

Dhouha Grissa et al. Database (Oxford). .

Abstract

The scientific knowledge about which genes are involved in which diseases grows rapidly, which makes it difficult to keep up with new publications and genetics datasets. The DISEASES database aims to provide a comprehensive overview by systematically integrating and assigning confidence scores to evidence for disease-gene associations from curated databases, genome-wide association studies (GWAS) and automatic text mining of the biomedical literature. Here, we present a major update to this resource, which greatly increases the number of associations from all these sources. This is especially true for the text-mined associations, which have increased by at least 9-fold at all confidence cutoffs. We show that this dramatic increase is primarily due to adding full-text articles to the text corpus, secondarily due to improvements to both the disease and gene dictionaries used for named entity recognition, and only to a very small extent due to the growth in number of PubMed abstracts. DISEASES now also makes use of a new GWAS database, Target Illumination by GWAS Analytics, which considerably increased the number of GWAS-derived disease-gene associations. DISEASES itself is also integrated into several other databases and resources, including GeneCards/MalaCards, Pharos/Target Central Resource Database and the Cytoscape stringApp. All data in DISEASES are updated on a weekly basis and is available via a web interface at https://diseases.jensenlab.org, from where it can also be downloaded under open licenses. Database URL: https://diseases.jensenlab.org.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of disease–gene associations in DISEASES. The number of disease–gene associations with a confidence score of at least 3 stars is proportional to the area of the pie charts, which represent high-level terms from Disease Ontology. In each pie chart, the associations are broken down by evidence type, i.e. curated knowledge, GWAS experiments and automatic text mining of the literature.
Figure 2.
Figure 2.
Performance improvement of the text mining channel. As shown in the ROC curves, text mining performs markedly better in the new version of DISEASES (Dict2,FullText2021) compared to the originally published one (Dict1,PubMed2013). To quantify the sources of improvements, we show two additional curves: one using the new dictionary on the latest abstract collection only (Dict2,PubMed2021) and another using the old dictionary on the same abstracts (Dict1,PubMed2021). Comparing the curves reveals that most of the improvement stems from the addition of full-text articles, but that the new disease and gene dictionaries also led to considerable improvement. By contrast, the growth in PubMed abstracts from 2013 to 2021 made only a minor difference. The insert shows a zoom of the high-confidence part of the plot.

References

    1. Pletscher-Frankild S., Pallejã A., Tsafou K. et al. (2015) DISEASES: text mining and data integration of disease-gene associations. Methods, 74, 83–89. doi: 10.1016/j.ymeth.2014.11.020 - DOI - PubMed
    1. Westergaard D., Stærfeldt H.-H., Tønsberg C. et al. (2018) A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput. Biol., 14, 1–16. doi: 10.1371/journal.pcbi.1005962 - DOI - PMC - PubMed
    1. Comeau D.C., Wei C.-H., Islamaj Doǧan R. et al. (2019) PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics, 35, 3533–3535. doi: 10.1093/bioinformatics/btz070 - DOI - PMC - PubMed
    1. Pandi M.-T., van der Spek P.J., Koromina M. et al. (2020) A novel text-mining approach for retrieving pharmacogenomics associations from the literature. Front. Pharmacol., 11, 602030. doi: 10.3389/fphar.2020.602030 - DOI - PMC - PubMed
    1. Karadeniz I., Hur J., He Y. et al. (2015) Literature mining and ontology based analysis of host-Brucella gene–gene interaction network. Front. Microbiol., 6, 1386. doi: 10.3389/fmicb.2015.01386 - DOI - PMC - PubMed

Publication types