Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Sep;127(9):97008.
doi: 10.1289/EHP4713. Epub 2019 Sep 26.

Generating the Blood Exposome Database Using a Comprehensive Text Mining and Database Fusion Approach

Affiliations

Generating the Blood Exposome Database Using a Comprehensive Text Mining and Database Fusion Approach

Dinesh Kumar Barupal et al. Environ Health Perspect. 2019 Sep.

Abstract

Background: Blood chemicals are routinely measured in clinical or preclinical research studies to diagnose diseases, assess risks in epidemiological research, or use metabolomic phenotyping in response to treatments. A vast volume of blood-related literature is available via the PubMed database for data mining.

Objectives: We aimed to generate a comprehensive blood exposome database of endogenous and exogenous chemicals associated with the mammalian circulating system through text mining and database fusion.

Methods: Using NCBI resources, we retrieved PubMed abstracts, PubChem chemical synonyms, and PMC supplementary tables. We then employed text mining and PubChem crowdsourcing to associate phrases relating to blood with PubChem chemicals. False positives were removed by a phrase pattern and a compound exclusion list.

Results: A query to identify blood-related publications in the PubMed database yielded 1.1 million papers. Matching a total of 15 million synonyms from 6.5 million relevant PubChem chemicals against all blood-related publications yielded 37,514 chemicals and 851,999 publications records. Mapping PubChem compound identifiers to the PubMed database yielded 49,940 unique chemicals linked to 676,643 papers. Analysis of open-access metabolomics papers related to blood phrases in the PMC database yielded 4,039 unique compounds and 204 papers. Consolidating these three approaches summed up to a total of 41,474 achiral structures that were linked to 65,957 PubChem CIDs and to over 878,966 PubMed articles. We mapped these compounds to 50 databases such as those covering metabolites and pathways, governmental and toxicological databases, pharmacology resources, and bioassay repositories. In comparison, HMDB, the Human Metabolome Database, links 1,075 compounds to blood-related primary publications.

Conclusion: This new Blood Exposome Database can be used for prioritizing chemicals for systematic reviews, developing target assays in exposome research, identifying compounds in untargeted mass spectrometry, and biological interpretation in metabolomics data. The database is available at http://bloodexposome.org. https://doi.org/10.1289/EHP4713.

PubMed Disclaimer

Figures

Figure 1 is a conceptual diagram explaining the workflow.
Figure 1.
Overview schema for constructing the Blood Exposome Database. Three NCBI hosted databases were used as inputs for the workflow that yielded 42,000 two dimensional structures for blood specimens.
Figure 2 is a conceptual diagram for HMDB blood (1075), PubChem PubMed mapping (29625), PubMed abstract search (28284), and PMC blood metabolisms (3436).
Figure 2.
Overlap analysis of the origin of 41,474 achiral blood chemicals. PubChem to PubMed mapping provided the most comprehensive overview of the blood related compounds.
Figures 3A, 3B, and 3C are graphical representations for distribution of lipophilicity, molecular weight, and publication count, respectively.
Figure 3.
Distribution of lipophilicity (A), molecular weight (B), and publication count (C) in the Blood Exposome Database. The y-axis shows the frequency of chemicals. Xlogp is a unitless measurement for lipophilicity, in which negative values indicate more polar compounds.

References

    1. Allen F, Pon A, Wilson M, Greiner R, Wishart D. 2014. CFM-ID: a web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra. Nucleic Acids Res 42(Web Server issue):W94–W99, PMID: 24895432, 10.1093/nar/gku436. - DOI - PMC - PubMed
    1. Ananiadou S, Pyysalo S, Tsujii J, Kell DB. 2010. Event extraction for systems biology by text mining the literature. Trends Biotechnol 28(7):381–390, PMID: 20570001, 10.1016/j.tibtech.2010.04.005. - DOI - PubMed
    1. Andra SS, Austin C, Patel D, Dolios G, Awawda M, Arora M. 2017. Trends in the application of high-resolution mass spectrometry for human biomonitoring: an analytical primer to studying the environmental chemical space of the human exposome. Environ Int 100:32–61, PMID: 28062070, 10.1016/j.envint.2016.11.026. - DOI - PMC - PubMed
    1. Barupal DK, Zhang Y, Shen T, Fan S, Roberts BS, Fitzgerald P, et al. . 2019. A comprehensive plasma metabolomics dataset for a cohort of mouse knockouts within the international mouse phenotyping consortium. Metabolites 9(5):E101, PMID: 31121816, 10.3390/metabo9050101. - DOI - PMC - PubMed
    1. Blaženović I, Kind T, Torbašinović H, Obrenović S, Mehta SS, Tsugawa H, et al. . 2017. Comprehensive comparison of in silico MS/MS fragmentation tools of the CASMI contest: Database boosting is needed to achieve 93% accuracy. J Cheminform 9(1):32, PMID: 29086039, 10.1186/s13321-017-0219-x. - DOI - PMC - PubMed

Publication types