Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 May 14:6:25.
doi: 10.1186/1758-2946-6-25. eCollection 2014.

QSAR DataBank - an approach for the digital organization and archiving of QSAR model information

Affiliations

QSAR DataBank - an approach for the digital organization and archiving of QSAR model information

Villu Ruusmann et al. J Cheminform. .

Abstract

Background: Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure-Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest).

Results: The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools.

Conclusions: The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.

Keywords: Data format; Data interoperability; Open science; QSAR; QSPR.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The layout of a QDB archive in the local file system.
Figure 2
Figure 2
Screenshot of the graphical QsarDB curation application.
Figure 3
Figure 3
Container type hierarchy. There are two abstract Container types (italic typeface) and five Concrete Container types (normal typeface). Descending Container types inherit all attributes and cargos from their parent Container types. For example, Property defines two attributes (i.e. Endpoint, Species). Additionally, Property inherits five (i.e. Id, Name, Description, Labels, Cargos) attributes and one cargo (i.e. BibTeX) from Container and three cargos (i.e. UCUM, values, references) from Parameter.
Figure 4
Figure 4
Flat (above) and structured (below) approach for the encoding of container attributes.
Figure 5
Figure 5
Disambiguation of container identifiers in DataDictionary element with the use of the “prefixed identifier” mechanism.
Figure 6
Figure 6
Different QSAR datasets relative to availability of experimental data and QSAR lifecycle. Training and internal validation share the same data set (blue). External validation and testing have their own disjoint data sets (orange and yellow).
Figure 7
Figure 7
Strong (solid arrows) and weak (dashed arrows) relationships between the five container types. The ordering of Container types (along the complexity axis) follows the incremental buildup of a QDB archive.

References

    1. Tropsha A. Best practices for QSAR model development, validation, and exploitation. Mol Inf. 2010;29:476–488. doi: 10.1002/minf.201000061. - DOI - PubMed
    1. Dearden JC, Cronin MT, Kaiser KL. How not to develop a quantitative structure-activity or structure–property relationship (QSAR/QSPR). SAR QSAR. Environ Res. 2009;20:241–266. - PubMed
    1. Stouch TR, Kenyon JR, Johnson SR, Chen XQ, Doweyko A, Li Y. In silico ADME/Tox: why models fail. J Comput Aided Mol Des. 2003;17:83–92. doi: 10.1023/A:1025358319677. - DOI - PubMed
    1. Foster I, Kesselman C. The Grid 2: Blueprint for a New Computing Infrastructure. San Francisco, CA: Morgan Kaufmann Publishers Inc.; 2003.
    1. Open Computing GRID for Molecular Science and Engineering (OpenMolGRID); EU 5-th FP, # IST-2001-37238, duration 2002–2005. [ http://www.openmolgrid.org]

LinkOut - more resources