Leveraging multi-source data to resolve inconsistency across pharmacogenomic datasets in drug sensitivity prediction

Xiaodi Li¹, Trisha Das^{1

2}, Kritib Bhattarai^{1

3}, Sivaraman Rajaganapathy¹, Vincent C Buchner^{1

3}, Yanshan Wang⁴, Chang Su⁵, Lichao Sun⁶, Liewei Wang⁷, James R Cerhan⁸, Nansu Zong¹

Affiliations

¹ Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA.
² University of Illinois Urbana-Champaign, Champaign, Illinois, United States.
³ Luther College, Decorah, Iowa, United States.
⁴ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Department of Health Service Administration and Policy, Temple University, Philadelphia, PA, USA.
⁶ Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA, USA.
⁷ Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN.
⁸ Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA.

PMID: 41726490
PMCID: PMC12919631

Leveraging multi-source data to resolve inconsistency across pharmacogenomic datasets in drug sensitivity prediction

Xiaodi Li et al. AMIA Annu Symp Proc. 2025.

. 2025 May 22:2024:744-753.

eCollection 2024.

Authors

Affiliations

¹ Department of Artificial Intelligence and Informatics Research, Mayo Clinic, Rochester, MN, USA.
² University of Illinois Urbana-Champaign, Champaign, Illinois, United States.
³ Luther College, Decorah, Iowa, United States.
⁴ Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
⁵ Department of Health Service Administration and Policy, Temple University, Philadelphia, PA, USA.
⁶ Department of Computer Science & Engineering, Lehigh University, Bethlehem, PA, USA.
⁷ Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN.
⁸ Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA.

PMID: 41726490
PMCID: PMC12919631

Abstract

Researchers have developed pharmacogenomics datasets for various purposes, such as biomarker identification, yet drug response prediction models often underperform due to dataset inconsistencies. These variations arise from inter-tumoral heterogeneity, experimental conditions, and cell subtype complexity, limiting model generalizability. To address this, we propose a computational model based on Aggregated Learning (AL) to enhance drug response prediction by learning from inconsistencies across multiple datasets. Our model minimizes discrepancies by training on overlapping inconsistent data points from three pharmacogenomic datasets-CCLE, GDSC2, and gCSI. Compared to four baseline methods-Selecting Better (SB), Result Average (RA), Combining Data (CD), and Model Average (MA)-our approach achieved superior performance with lower Mean Absolute Error (MAE) scores: 0.090 (CCLE-GDSC), 0.096 (CCLE-gCSI), and 0.081 (GDSC-gCSI). These results demonstrate that addressing inconsistencies enhances prediction accuracy and generalizability, making our model a promising solution for robust drug response predictions.

PubMed Disclaimer

Figures

**Figure 1.**
The framework for the drug sensitivity prediction model. (1) We collect the drug, cell line, and gene sensitivity scores from three core datasets: GDSC, CCLE, and gCSI. (2) We preprocess these datasets and use them for the embedding input for the models. (3) We train the basic learning models $M_{B L}^{1}$ and $M_{B L}^{2}$ using all the data from Dataset 1 and Dataset 2, respectively. We also train the inconsistency aggregation models $M_{I A}^{1}$ and $M_{B L}^{1}$ by using the overlapping data and the sensitive scores generated by $M_{B L}^{2}$ and $M_{I A}^{1}$ . For testing, we follow a similar process, but since we do not know the labels of the testing data, we output the average scores and calculate the MAE.

**Figure 2.**
Cross-validation results on common drug-cell line pairs.

**Figure 3.**
Cross-validation results on CCLE-GDSC with different numbers of non-overlapping samples (left to right). Please note that for gCSI-CCLE with changing gCSI, we cannot calculate 1.5:1 and 2:1 results due to a lack of enough inconsistent data points.

**Figure 4.**
MAE after feature selection for different datasets.

**Figure 5.**
MAE for different regression techniques.

See this image and copyright information in PMC

References

1. Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2022. CA: A Cancer Journal for Clinicians. 2022;72(1) doi:10.3322/caac.21708.
1. Bodmer WF. Cancer genetics: colorectal cancer as a model. Journal of Human Genetics. 2006;51(5):391–6. doi:10.1007/s10038-006-0373-x. - PMC - PubMed
1. Garraway LA, Verweij J, Ballman KV. Precision Oncology: An Overview. Journal of Clinical Oncology Official Journal of the American Society of Clinical Oncology. 2013;31(15):1803–5. doi:10.1200/JCO.2013.49.4799. - PubMed
1. Chawla S, Rockstroh A, Lehman ML, Ratther E, Jain A, Anand A, et al. Gene expression based inference of cancer drug sensitivity. Nature Communications. 2022:13. doi:10.1038/s41467-022-33291-z.
1. Paltun BG, Mamitsuka H, Kaski S. Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches. Briefings in Bioinformatics. 2019;22:346–59. doi:10.1093/bib/bbz153.

MeSH terms

Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Leveraging multi-source data to resolve inconsistency across pharmacogenomic datasets in drug sensitivity prediction

Affiliations

Leveraging multi-source data to resolve inconsistency across pharmacogenomic datasets in drug sensitivity prediction

Authors

Affiliations

Abstract

Figures

References

MeSH terms