Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017;3(1):1.
Epub 2017 Apr 26.

High-Throughput Screening Assay Datasets from the PubChem Database

Affiliations

High-Throughput Screening Assay Datasets from the PubChem Database

Mariusz Butkiewicz et al. Chem Inform. 2017.

Abstract

Availability of high-throughput screening (HTS) data in the public domain offers great potential to foster development of ligand-based computer-aided drug discovery (LB-CADD) methods crucial for drug discovery efforts in academia and industry. LB-CADD method development depends on high-quality HTS assay data, i.e., datasets that contain both active and inactive compounds. These active compounds are hits from primary screens that have been tested in concentration-response experiments and where the target-specificity of the hits has been validated through suitable secondary screening experiments. Publicly available HTS repositories such as PubChem often provide such data in a convoluted way: compounds that are classified as inactive need to be extracted from the primary screening record. However, compounds classified as active in the primary screening record are not suitable as a set of active compounds for LB-CADD experiments due to high false-positive rate. A suitable set of actives can be derived by carefully analysing results in often up to five or more assays that are used to confirm and classify the activity of compounds. These assays, in part, build on each other. However, often not all hit compounds from the previous screen have been tested. Sometimes a compound can be classified as 'active', though its meaning is 'inactive' on the target of interest as it is 'active' on a different target protein. Here, a curation process of hierarchically related confirmatory screens is illustrated based on two specifically chosen protein use-cases. The subsequent re-upload procedure into PubChem is described for the findings of those two scenarios. Further, we provide nine publicly accessible high quality datasets for future LB-CADD method development that provide a common baseline for comparison of future methods to the scientific community. We also provide a protocol researchers can follow to upload additional datasets for benchmarking.

Keywords: Datasets; HTS; LB-CADD; PubChem.

PubMed Disclaimer

Conflict of interest statement

Competing Interests The authors declare that they have no competing interests.

Figures

Figure 1
Figure 1
Curation process of AID1040. The center green arrow represents the initial set of active compounds while red arrows symbolize a specific subtraction of compounds.
Figure 2
Figure 2
Curation process of AID793. The center green arrow leads to the final set of active compounds while red arrows and numbers and type of compounds in red mark compound subtractions.

References

    1. Sliwoski G, Kothiwale S, Meiler J, Lowe EW. Computational methods in drug discovery. Pharmacol Rev. 2014;66:334–395. - PMC - PubMed
    1. Vlaar CP, Hernandez L. Symposium review: drug discovery, development and clinical research in academia. P Health Sci J. 2009;283:268–273. - PubMed
    1. Verkman AS. Drug discovery in academia. Am J Physiol Cell Physiol. 2004;28:465–474. - PubMed
    1. LeCun Y, Cortes C. MNIST handwritten digit database 2010
    1. Frank A, Asuncion A. UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science; 2010.

LinkOut - more resources