Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Apr 1;21(2):253-268.
doi: 10.1093/biostatistics/kxy044.

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Affiliations

The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models

Yuqing Zhang et al. Biostatistics. .

Abstract

Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.

Keywords: Cross-study validation; Data heterogeneity; Genomic prediction models.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A schema of our study. Simulation methods (using the breast cancer microarray studies as an example) are summarized in this flow chart.
Fig. 2.
Fig. 2.
Simulation results comparing performances of cross-validation and cross-study validation. The “Baseline” scenario does not modify any source of heterogeneity. In “Rebalanced covariates”, we change the re-sampling probability to match the distribution of covariates. “Filtered genes” considers only genes with high integrative correlation. “Same hazard” uses the same cumulative hazard but different coefficients for simulations using microarray studies with time-to-event outcome. “Same models” uses the same data generating models. The numbers within the boxes show the median of the distributions. We observe that the sources of heterogeneity investigated in this work do not fully account for the loss of accuracy comparing across- to within-study validation.
Fig. 3.
Fig. 3.
Simulation results comparing CSV to CV for evaluating Más-o-menos risk prediction models in ovarian cancer microarray and RNA-seq studies. Colored boxes represent C-indices collected in within and across-study validation when RNA-seq data is involved. White/grey boxes represent results using only microarray studies. Note that the microarray simulations in this figure are different from those in Figure 2. In this figure, the comparison is limited to the TCGA study, which contains only the 190 overlapping samples between the microarray and RNA-seq platforms.
Fig. 4.
Fig. 4.
Average probability of survival of each set and the combination of all sets for breast cancer. We compute an expected survival function for each individual in every dataset using the true cumulative hazard and the linear predictor. We then average these survival functions across patients within each dataset. Colored lines represent average survival functions in each original sets. The black line shows the average survival function of all datasets combined. This figure shows the differences in the “true” model from each original study.

Similar articles

Cited by

References

    1. Aalen, O. (1978). Nonparametric inference for a family of counting processes. The Annals of Statistics 6, 701–726.
    1. Abubucker, S., Segata, N.,Goll, J.,Schubert, A. M., Izard,Jacques, C.,Brandi L.,Rodriguez-Mueller, B.,Zucker, J.,Thiagarajan, M.. and others (2012). Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Computational Biology 8, e1002358. - PMC - PubMed
    1. Bender, R., Augustin, T. and Blettner, M. (2005). Generating survival times to simulate cox proportional hazards models. Statistics in Medicine. 24, 1713–1723. - PubMed
    1. Bernau, C., Riester, M.,Boulesteix, A.-L.,Parmigiani, G.,Huttenhower, C.,Waldron, L. and Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30, i105–i112. - PMC - PubMed
    1. Binder, H. and Schumacher, M. (2008). Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics 9, 14. - PMC - PubMed

Publication types

LinkOut - more resources