. 2023 May 19;15(1):53.

doi: 10.1186/s13321-023-00716-w.

Combatting over-specialization bias in growing chemical databases

Katharina Dost^{1

2}, Zac Pullar-Strecker³, Liam Brydon³, Kunyang Zhang⁴, Jasmin Hafner⁴, Patricia J Riddle³, Jörg S Wicker^{3

5}

Affiliations

¹ School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand. katharina.dost@auckland.ac.nz.
² enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany. katharina.dost@auckland.ac.nz.
³ School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand.
⁴ Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland.
⁵ enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany.

PMID: 37208694
PMCID: PMC10197453
DOI: 10.1186/s13321-023-00716-w

Combatting over-specialization bias in growing chemical databases

Katharina Dost et al. J Cheminform. 2023.

. 2023 May 19;15(1):53.

doi: 10.1186/s13321-023-00716-w.

Authors

Katharina Dost^{1

2}, Zac Pullar-Strecker³, Liam Brydon³, Kunyang Zhang⁴, Jasmin Hafner⁴, Patricia J Riddle³, Jörg S Wicker^{3

5}

Affiliations

¹ School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand. katharina.dost@auckland.ac.nz.
² enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany. katharina.dost@auckland.ac.nz.
³ School of Computer Science, University of Auckland, 38 Princes Street, 1010, Auckland, New Zealand.
⁴ Eawag-Swiss Federal Institute of Aquatic Science and Technology, Überlandstrasse 133, 8600, Dübendorf, Switzerland.
⁵ enviPath UG & Co. KG, In den Graswiesen 13, 55437, Ockenheim, Germany.

PMID: 37208694
PMCID: PMC10197453
DOI: 10.1186/s13321-023-00716-w

Abstract

Background: Predicting in advance the behavior of new chemical compounds can support the design process of new products by directing the research toward the most promising candidates and ruling out others. Such predictive models can be data-driven using Machine Learning or based on researchers' experience and depend on the collection of past results. In either case: models (or researchers) can only make reliable assumptions about compounds that are similar to what they have seen before. Therefore, consequent usage of these predictive models shapes the dataset and causes a continuous specialization shrinking the applicability domain of all trained models on this dataset in the future, and increasingly harming model-based exploration of the space.

Proposed solution: In this paper, we propose CANCELS (CounterActiNg Compound spEciaLization biaS), a technique that helps to break the dataset specialization spiral. Aiming for a smooth distribution of the compounds in the dataset, we identify areas in the space that fall short and suggest additional experiments that help bridge the gap. Thereby, we generally improve the dataset quality in an entirely unsupervised manner and create awareness of potential flaws in the data. CANCELS does not aim to cover the entire compound space and hence retains a desirable degree of specialization to a specified research domain.

Results: An extensive set of experiments on the use-case of biodegradation pathway prediction not only reveals that the bias spiral can indeed be observed but also that CANCELS produces meaningful results. Additionally, we demonstrate that mitigating the observed bias is crucial as it cannot only intervene with the continuous specialization process, but also significantly improves a predictor's performance while reducing the number of required experiments. Overall, we believe that CANCELS can support researchers in their experimentation process to not only better understand their data and potential flaws, but also to grow the dataset in a sustainable way. All code is available under github.com/KatDost/Cancels .

Keywords: Bias; Chemical compound space; Data quality; Machine learning.

PubMed Disclaimer

Conflict of interest statement

Jörg Wicker (co-founder, CTO) and Katharina Dost are employees of enviPath UG & Co. KG, a scientific software development company that develops and maintains the enviPath system. The authors declare no competing interests.

Figures

**Fig. 1**
Overview over the imitate algorithm

**Fig. 2**
Comparison of the Gaussians fitted to a biased dataset (left) when using the traditional Expectation-Maximization fitting (center) and the fitting procedure outlined in the imitate algorithm (right)

**Fig. 4**
Comparison of different weights for imitate with a custom boundary

**Fig. 5**
Qualitative dataset development for SOIL and BBD root compounds in relation to the compound space represented by the PubChem dataset and visualized in the PCA spaces obtained from SOIL (top), BBD (center), and PubChem (bottom). In all three datasets, white represents the highest density

**Fig. 6**
Quantitative development of SOIL and BBD root compounds in terms of the compound’s average distance to their center (top) and their dataset size (bottom)

**Fig. 7**
Potential biases detected by cancels for SOIL (top) and BBD (bottom) visualized in their respective PCA spaces against the PubChem compound space

**Fig. 8**
Qualitative evaluation of the top 20 and top 50 compounds suggested by cancels to mitigate the detected biases in SOIL (top) and BBD (bottom) in comparison to the respective dataset’s compounds and the “Agrochemical” subset of PubChem. Note that categories are non-exclusive

**Fig. 9**
While holding out $x %$ of the SOIL (top) and BBD (bottom) datasets, we train cancels on the rest. Bar heights represent average scores of the holdout set with their corresponding uncertainty intervals (black lines)

**Fig. 10**
Dividing the Tox21 dataset into a training set, a pool, and a test set, we train a classifier on either the training set only, the training set together with the entire pool, the training set plus cancels-based compound selection, and the training set plus a selection that feeds the biases instead of mitigating it. The box plot (left) displays the results in terms of accuracy when evaluating the trained models on the test set. A confidence interval plot (right) indicates that compound selection using cancels is significantly better than all other options

**Fig. 11**
Influence of different compound representations on cancels’s performance

**Fig. 12**
Influence of the number of principal components used in cancels’ dimensionality reduction

**Fig. 13**
Iterative application of cancels and all competing baselines (see Fig. 10) on the Tox21 dataset: In each of the five iterations, the compound selection takes place based on the training set and the selected compounds from previous iterations. For cancels, the accuracy improves upon all other selection strategies

**Fig. 14**
Number of added compounds in an iterative application of cancels

See this image and copyright information in PMC

References

1. Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356(6334):183–186. doi: 10.1126/science.aal4230. - DOI - PubMed
1. Sieg J, Flachsenberg F, Rarey M. In need of bias control: evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inf Model. 2019;59(3):947–961. doi: 10.1021/acs.jcim.8b00712. - DOI - PubMed
1. Hert J, Irwin JJ, Laggner C, Keiser MJ, Shoichet BK. Quantifying biogenic bias in screening libraries. Nat Chem Biol. 2009;5(7):479–483. doi: 10.1038/nchembio.180. - DOI - PMC - PubMed
1. Kerstjens A, De Winter H. LEADD: lamarckian evolutionary algorithm for de novo drug design. J Cheminform. 2022;14(1):1–20. doi: 10.1186/s13321-022-00582-y. - DOI - PMC - PubMed
1. Gregori-Puigjané E, Mestres J. Coverage and bias in chemical library design. Curr Opin Chem Biol. 2008;12(3):359–365. doi: 10.1016/j.cbpa.2008.03.015. - DOI - PubMed

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Combatting over-specialization bias in growing chemical databases

Affiliations

Combatting over-specialization bias in growing chemical databases

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous