From data point timelines to a well curated data set, data mining of experimental data and chemical structure data from scientific articles, problems and possible solutions
- PMID: 23884706
- DOI: 10.1007/s10822-013-9664-4
From data point timelines to a well curated data set, data mining of experimental data and chemical structure data from scientific articles, problems and possible solutions
Abstract
The scientific literature is important source of experimental and chemical structure data. Very often this data has been harvested into smaller or bigger data collections leaving the data quality and curation issues on shoulders of users. The current research presents a systematic and reproducible workflow for collecting series of data points from scientific literature and assembling a database that is suitable for the purposes of high quality modelling and decision support. The quality assurance aspect of the workflow is concerned with the curation of both chemical structures and associated toxicity values at (1) single data point level and (2) collection of data points level. The assembly of a database employs a novel "timeline" approach. The workflow is implemented as a software solution and its applicability is demonstrated on the example of the Tetrahymena pyriformis acute aquatic toxicity endpoint. A literature collection of 86 primary publications for T. pyriformis was found to contain 2,072 chemical compounds and 2,498 unique toxicity values, which divide into 2,440 numerical and 58 textual values. Every chemical compound was assigned to a preferred toxicity value. Examples for most common chemical and toxicological data curation scenarios are discussed.
Similar articles
-
Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD).BMC Bioinformatics. 2009 Oct 8;10:326. doi: 10.1186/1471-2105-10-326. BMC Bioinformatics. 2009. PMID: 19814812 Free PMC article.
-
BioCreative III interactive task: an overview.BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-12-S8-S4. BMC Bioinformatics. 2011. PMID: 22151968 Free PMC article.
-
A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.Database (Oxford). 2013 Nov 28;2013:bat080. doi: 10.1093/database/bat080. Print 2013. Database (Oxford). 2013. PMID: 24288140 Free PMC article.
-
Visualization, Inspection and Interpretation of Shotgun Proteomics Identification Results.Adv Exp Med Biol. 2016;919:227-235. doi: 10.1007/978-3-319-41448-5_11. Adv Exp Med Biol. 2016. PMID: 27975220 Review.
-
Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines in Shotgun Proteomics.Adv Exp Med Biol. 2016;919:217-226. doi: 10.1007/978-3-319-41448-5_10. Adv Exp Med Biol. 2016. PMID: 27975219 Review.
Cited by
-
QSAR DataBank - an approach for the digital organization and archiving of QSAR model information.J Cheminform. 2014 May 14;6:25. doi: 10.1186/1758-2946-6-25. eCollection 2014. J Cheminform. 2014. PMID: 24910716 Free PMC article.
-
The good, the bad, and the ugly in chemical and biological data for machine learning.Drug Discov Today Technol. 2019 Dec;32-33:3-8. doi: 10.1016/j.ddtec.2020.07.001. Epub 2020 Jul 26. Drug Discov Today Technol. 2019. PMID: 33386092 Free PMC article.
-
Best Practices for QSAR Model Reporting: Physical and Chemical Properties, Ecotoxicity, Environmental Fate, Human Health, and Toxicokinetics Endpoints.Environ Health Perspect. 2018 Dec;126(12):126001. doi: 10.1289/EHP3264. Environ Health Perspect. 2018. PMID: 30561225 Free PMC article. Review.
-
Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies.Front Genet. 2020 Dec 22;11:618862. doi: 10.3389/fgene.2020.618862. eCollection 2020. Front Genet. 2020. PMID: 33414815 Free PMC article.
-
How should the completeness and quality of curated nanomaterial data be evaluated?Nanoscale. 2016 May 21;8(19):9919-43. doi: 10.1039/c5nr08944a. Epub 2016 May 4. Nanoscale. 2016. PMID: 27143028 Free PMC article.
References
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources