Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 May 19;15(1):53.
doi: 10.1186/s13321-023-00716-w.

Combatting over-specialization bias in growing chemical databases

Affiliations

Combatting over-specialization bias in growing chemical databases

Katharina Dost et al. J Cheminform. .

Abstract

Background: Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers' experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution: In this paper, we propose CANCELS (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. CANCELS does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results: An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that CANCELS produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor's performance while reducing the number of required experiments. Overall, we believe that CANCELS can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .

Keywords: Bias; Chemical compound space; Data quality; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Jörg Wicker (co-founder, CTO) and Katharina Dost are employees of enviPath UG & Co. KG, a scientific software development company that develops and maintains the enviPath system. The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Overview over the imitate algorithm
Fig. 2
Fig. 2
Comparison of the Gaussians fitted to a biased dataset (left) when using the traditional Expectation-Maximization fitting (center) and the fitting procedure outlined in the imitate algorithm (right)
Fig. 3
Fig. 3
Overview over cancels
Fig. 4
Fig. 4
Comparison of different weights for imitate with a custom boundary
Fig. 5
Fig. 5
Qualitative dataset development for SOIL and BBD root compounds in relation to the compound space represented by the PubChem dataset and visualized in the PCA spaces obtained from SOIL (top), BBD (center), and PubChem (bottom). In all three datasets, white represents the highest density
Fig. 6
Fig. 6
Quantitative development of SOIL and BBD root compounds in terms of the compound’s average distance to their center (top) and their dataset size (bottom)
Fig. 7
Fig. 7
Potential biases detected by cancels for SOIL (top) and BBD (bottom) visualized in their respective PCA spaces against the PubChem compound space
Fig. 8
Fig. 8
Qualitative evaluation of the top 20 and top 50 compounds suggested by cancels to mitigate the detected biases in SOIL (top) and BBD (bottom) in comparison to the respective dataset’s compounds and the “Agrochemical” subset of PubChem. Note that categories are non-exclusive
Fig. 9
Fig. 9
While holding out x% of the SOIL (top) and BBD (bottom) datasets, we train cancels on the rest. Bar heights represent average scores of the holdout set with their corresponding uncertainty intervals (black lines)
Fig. 10
Fig. 10
Dividing the Tox21 dataset into a training set, a pool, and a test set, we train a classifier on either the training set only, the training set together with the entire pool, the training set plus cancels-based compound selection, and the training set plus a selection that feeds the biases instead of mitigating it. The box plot (left) displays the results in terms of accuracy when evaluating the trained models on the test set. A confidence interval plot (right) indicates that compound selection using cancels is significantly better than all other options
Fig. 11
Fig. 11
Influence of different compound representations on cancels’s performance
Fig. 12
Fig. 12
Influence of the number of principal components used in cancels’ dimensionality reduction
Fig. 13
Fig. 13
Iterative application of cancels and all competing baselines (see Fig. 10) on the Tox21 dataset: In each of the five iterations, the compound selection takes place based on the training set and the selected compounds from previous iterations. For cancels, the accuracy improves upon all other selection strategies
Fig. 14
Fig. 14
Number of added compounds in an iterative application of cancels

References

    1. Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356(6334):183–186. doi: 10.1126/science.aal4230. - DOI - PubMed
    1. Sieg J, Flachsenberg F, Rarey M. In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inf Model. 2019;59(3):947–961. doi: 10.1021/acs.jcim.8b00712. - DOI - PubMed
    1. Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK. Quantifying biogenic bias in screening libraries. Nat Chem Biol. 2009;5(7):479–483. doi: 10.1038/nchembio.180. - DOI - PMC - PubMed
    1. Kerstjens A, De Winter H. LEADD: lamarckian evolutionary algorithm for de novo drug design. J Cheminform. 2022;14(1):1–20. doi: 10.1186/s13321-022-00582-y. - DOI - PMC - PubMed
    1. Gregori-Puigjané E, Mestres J. Coverage and bias in chemical library design. Curr Opin Chem Biol. 2008;12(3):359–365. doi: 10.1016/j.cbpa.2008.03.015. - DOI - PubMed

LinkOut - more resources