Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Nov 1:12:10.1016/j.comtox.2019.100096.
doi: 10.1016/j.comtox.2019.100096.

EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research

Affiliations

EPA's DSSTox database: History of development of a curated chemistry resource supporting computational toxicology research

Christopher M Grulke et al. Comput Toxicol. .

Abstract

The US Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database, launched publicly in 2004, currently exceeds 875 K substances spanning hundreds of lists of interest to EPA and environmental researchers. From its inception, DSSTox has focused curation efforts on resolving chemical identifier errors and conflicts in the public domain towards the goal of assigning accurate chemical structures to data and lists of importance to the environmental research and regulatory community. Accurate structure-data associations, in turn, are necessary inputs to structure-based predictive models supporting hazard and risk assessments. In 2014, the legacy, manually curated DSSTox_V1 content was migrated to a MySQL data model, with modern cheminformatics tools supporting both manual and automated curation processes to increase efficiencies. This was followed by sequential auto-loads of filtered portions of three public datasets: EPA's Substance Registry Services (SRS), the National Library of Medicine's ChemID, and PubChem. This process was constrained by a key requirement of uniquely mapped identifiers (i.e., CAS RN, name and structure) for each substance, rejecting content where any two identifiers were conflicted either within or across datasets. This rejected content highlighted the degree of conflicting, inaccurate substance-structure ID mappings in the public domain, ranging from 12% (within EPA SRS) to 49% (across ChemID and PubChem). Substances successfully added to DSSTox from each auto-load were assigned to one of five qc_levels, conveying curator confidence in each dataset. This process enabled a significant expansion of DSSTox content to provide better coverage of the chemical landscape of interest to environmental scientists, while retaining focus on the accuracy of substance-structure-data associations. Currently, DSSTox serves as the core foundation of EPA's CompTox Chemicals Dashboard [https://comptox.epa.gov/dashboard], which provides public access to DSSTox content in support of a broad range of modeling and research activities within EPA and, increasingly, across the field of computational toxicology.

Keywords: Chemistry database; Computational toxicology; DSSTox; Data quality; Environmental science; QSAR; Structure curation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Fig. 1.
Fig. 1.
Schematic illustrating the main tabular and relationship components of the DSSTox_V2 data model, centered around the DSSTox_Core substance-structure content.
Fig. 2.
Fig. 2.
Screen snapshot view of DSSTox ChemReg application, built to provide an interface for trained DSSTox curators to register and edit new and existing DSSTox substance records subject to structure and substance data model controls.
Fig. 3.
Fig. 3.
Most commonly encountered cases of “Predecessor” substance records, with either no CAS RN or no structure, mapped to a corresponding “Successor” substance containing a CAS RN and structure.
Fig. 4.
Fig. 4.
The process by which content from 3 public databases (EPA’s Substance Registry Services - SRS, NLM’s ChemID, and PubChem) was quality filtered, and either assigned to one of five qc_levels and sequentially loaded into the DSSTox_Core portion of the DSSTox_V2 data model in 2014 or rejected and placed in the Public_Untrusted bin, requiring further curation review along with other queued EPA lists.
Fig. 5.
Fig. 5.
Two ChemID substance records listing the same structure (and InChIKey) for two different Substance Names and CAS RNs. In this case, the Names and CAS RNs are correctly paired, but the structure assigned to the top record is an approximate representation given that the position of the triple bond is unspecified.
Fig. 6.
Fig. 6.
Shown for each of the three public databases that were sequentially added during the DSSTox_V2 expansion phase - (a) EPA SRS, (b) ChemID, and (c) PubChem - is the process by which chemicals were quality filtered, and the numbers of chemicals at each step that were either removed from further consideration or moved forward for possible incorporation into the expanding DSSTox_Core.
Fig. 7.
Fig. 7.
Snapshot view of the DSSTox Curation Interface used by DSSTox curators to register lists; shown on the left are the totals in the various identifier conflict bins that remain to be curator-validated, where each bin and each conflicted record within each bin can be accessed by the curator (2 expanded views shown).
Fig. 8.
Fig. 8.
Total numbers of DSSTox substances and registered lists (public and Internal EPA) as of February 2019.
Fig. 9.
Fig. 9.
Dashboard view of three types of Markush structures, with a representative sample of the 100 enumerated “child” structures shown for “Polychlorinated biphenyls”, as retrieved under the tab “RELATED SUBSTANCES” [https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID5024267#related-substances].

References

    1. Richard AM, DSSTox Website launch: improving public access to databases for building structure-toxicity prediction models, Preclinica 2 (2004) 103–108.
    1. Richard AM, Yang C, Judson RS, Toxicity data informatics: supporting a new paradigm for toxicity prediction, Toxicol. Mech. Methods 18 (2–3) (2008) 103–118, 10.1080/15376510701857452. - DOI - PubMed
    1. Richard AM, Gold LS, Nicklaus MC, Chemical structure indexing of toxicity data on the internet: moving toward a flat world, Available at, Curr. Opin. Drug Discov. Dev 9 (3) (2006) 314–325 http://www.ncbi.nlm.nih.gov/pubmed/16729727. - PubMed
    1. Richard AM, Williams CR, Distributed structure-searchable toxicity (DSSTox) public database network: a proposal, Available at: Mutat. Res 499 (1) (2002) 27–52 http://www.ncbi.nlm.nih.gov/pubmed/11804603. - PubMed
    1. Bolton EE, Wang Y, Thiessen PA, Bryant SH, 2008. Chapter 12 PubChem: Integrated Platform of Small Molecules and Biological Activities (pp. 217–241).

LinkOut - more resources