. 2020 Apr 1;21(2):253-268.

doi: 10.1093/biostatistics/kxy044.

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Yuqing Zhang¹, Christoph Bernau², Giovanni Parmigiani^{3

4}, Levi Waldron⁵

Affiliations

¹ Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, MA, USA.
² Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, Germany.
³ Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 3 Blackfan Cir, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA.
⁵ Graduate School of Public Health and Health Policy, Institute for Implementation Science in Population Health, City University of New York, 55 W 125th St, New York, NY, USA.

PMID: 30202918
PMCID: PMC7868050
DOI: 10.1093/biostatistics/kxy044

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Yuqing Zhang et al. Biostatistics. 2020.

. 2020 Apr 1;21(2):253-268.

doi: 10.1093/biostatistics/kxy044.

Authors

Yuqing Zhang¹, Christoph Bernau², Giovanni Parmigiani^{3

4}, Levi Waldron⁵

Affiliations

¹ Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, MA, USA.
² Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, Germany.
³ Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 3 Blackfan Cir, Boston, MA, USA.
⁴ Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA.
⁵ Graduate School of Public Health and Health Policy, Institute for Implementation Science in Population Health, City University of New York, 55 W 125th St, New York, NY, USA.

PMID: 30202918
PMCID: PMC7868050
DOI: 10.1093/biostatistics/kxy044

Abstract

Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.

Keywords: Cross-study validation; Data heterogeneity; Genomic prediction models.

PubMed Disclaimer

Figures

**Fig. 1.**
A schema of our study. Simulation methods (using the breast cancer microarray studies as an example) are summarized in this flow chart.

**Fig. 2.**
Simulation results comparing performances of cross-validation and cross-study validation. The “Baseline” scenario does not modify any source of heterogeneity. In “Rebalanced covariates”, we change the re-sampling probability to match the distribution of covariates. “Filtered genes” considers only genes with high integrative correlation. “Same hazard” uses the same cumulative hazard but different coefficients for simulations using microarray studies with time-to-event outcome. “Same models” uses the same data generating models. The numbers within the boxes show the median of the distributions. We observe that the sources of heterogeneity investigated in this work do not fully account for the loss of accuracy comparing across- to within-study validation.

**Fig. 3.**
Simulation results comparing CSV to CV for evaluating Más-o-menos risk prediction models in ovarian cancer microarray and RNA-seq studies. Colored boxes represent C-indices collected in within and across-study validation when RNA-seq data is involved. White/grey boxes represent results using only microarray studies. Note that the microarray simulations in this figure are *different from* those in Figure 2. In this figure, the comparison is limited to the TCGA study, which contains only the 190 overlapping samples between the microarray and RNA-seq platforms.

**Fig. 4.**
Average probability of survival of each set and the combination of all sets for breast cancer. We compute an expected survival function for each individual in every dataset using the true cumulative hazard and the linear predictor. We then average these survival functions across patients within each dataset. Colored lines represent average survival functions in each original sets. The black line shows the average survival function of all datasets combined. This figure shows the differences in the “true” model from each original study.

See this image and copyright information in PMC

Cited by

Leveraging Multi-omics to Disentangle the Complexity of Ovarian Cancer.
Lin S, Nguyen LL, McMellen A, Leibowitz MS, Davidson N, Spinosa D, Bitler BG. Lin S, et al. Mol Diagn Ther. 2025 Mar;29(2):145-151. doi: 10.1007/s40291-024-00757-3. Epub 2024 Nov 18. Mol Diagn Ther. 2025. PMID: 39557776 Review.
Tree-Weighting for Multi-Study Ensemble Learners.
Ramchandran M, Patil P, Parmigiani G. Ramchandran M, et al. Pac Symp Biocomput. 2020;25:451-462. Pac Symp Biocomput. 2020. PMID: 31797618 Free PMC article.
Novel molecular classification and prognosis of papillary renal cell carcinoma based on a large-scale CRISPR-Cas9 screening and machine learning.
Liu C, Yuan ZY, Zhang XX, Chang JJ, Yang Y, Sun SJ, Du Y, Zhan HQ. Liu C, et al. Heliyon. 2023 Dec 3;10(1):e23184. doi: 10.1016/j.heliyon.2023.e23184. eCollection 2024 Jan 15. Heliyon. 2023. PMID: 38163209 Free PMC article.
Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies.
Gao Y, Sun F. Gao Y, et al. PLoS Comput Biol. 2023 Oct 16;19(10):e1010608. doi: 10.1371/journal.pcbi.1010608. eCollection 2023 Oct. PLoS Comput Biol. 2023. PMID: 37844077 Free PMC article.
Robustifying genomic classifiers to batch effects via ensemble learning.
Zhang Y, Patil P, Johnson WE, Parmigiani G. Zhang Y, et al. Bioinformatics. 2021 Jul 12;37(11):1521-1527. doi: 10.1093/bioinformatics/btaa986. Bioinformatics. 2021. PMID: 33245114 Free PMC article.

See all "Cited by" articles

References

1. Aalen, O. (1978). Nonparametric inference for a family of counting processes. The Annals of Statistics 6, 701–726.
1. Abubucker, S., Segata, N.,Goll, J.,Schubert, A. M., Izard,Jacques, C.,Brandi L.,Rodriguez-Mueller, B.,Zucker, J.,Thiagarajan, M.. and others (2012). Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Computational Biology 8, e1002358. - PMC - PubMed
1. Bender, R., Augustin, T. and Blettner, M. (2005). Generating survival times to simulate cox proportional hazards models. Statistics in Medicine. 24, 1713–1723. - PubMed
1. Bernau, C., Riester, M.,Boulesteix, A.-L.,Parmigiani, G.,Huttenhower, C.,Waldron, L. and Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30, i105–i112. - PMC - PubMed
1. Binder, H. and Schumacher, M. (2008). Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics 9, 14. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Affiliations

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources