Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Mar 6;32(3):344-352.
doi: 10.1158/1055-9965.EPI-22-0875.

Elucidating Analytic Bias Due to Informative Cohort Entry in Cancer Clinico-genomic Datasets

Affiliations

Elucidating Analytic Bias Due to Informative Cohort Entry in Cancer Clinico-genomic Datasets

Kenneth L Kehl et al. Cancer Epidemiol Biomarkers Prev. .

Abstract

Background: Oncologists often order genomic testing to inform treatment for worsening cancer. The resulting correlation between genomic testing timing and prognosis, or "informative entry," can bias observational clinico-genomic research. The efficacy of existing approaches to this problem in clinico-genomic cohorts is poorly understood.

Methods: We simulated clinico-genomic cohorts followed from an index date to death. Subgroups in each cohort who underwent genomic testing before death were "observed." We varied data generation parameters under four scenarios: (i) independent testing and survival times; (ii) correlated testing and survival times for all patients; (iii) correlated testing and survival times for a subset of patients; and (iv) testing and mortality exclusively following progression events. We examined the behavior of conditional Kendall tau (Tc) statistics, Cox entry time coefficients, and biases in overall survival (OS) estimation and biomarker inference across scenarios.

Results: Scenario #1 yielded null Tc and Cox entry time coefficients and unbiased OS inference. Scenario #2 yielded positive Tc, negative Cox entry time coefficients, underestimated OS, and biomarker associations biased toward the null. Scenario #3 yielded negative Tc, positive Cox entry time coefficients, and underestimated OS, but biomarker estimates were less biased. Scenario #4 yielded null Tc and Cox entry time coefficients, underestimated OS, and biased biomarker estimates. Transformation and copula modeling did not provide unbiased results.

Conclusions: Approaches to informative clinico-genomic cohort entry, including Tc and Cox entry time statistics, are sensitive to heterogeneity in genotyping and survival time distributions.

Impact: Novel methods are needed for unbiased inference using observational clinico-genomic data.

PubMed Disclaimer

Conflict of interest statement

CONFLICT OF INTEREST: Drs. Kehl, Schrag, and Pangeas; and Ms. Brown and Ms. Lavery; report receiving research funding from the AACR Project GENIE Biopharmaceutical Consortium, involving derivation of clinico-genomic cohorts, relevant to the submitted work. The other authors declare no potential conflicts of interest.

Figures

Figure 1:
Figure 1:
Diagram depicting informative left truncation. Figure 1A: Example disease trajectory for a patient from a full “true” cohort in which next-generation sequencing (NGS) is not required for inclusion, and all time after diagnosis is at risk for mortality. Figure 1B: Example disease trajectory for a patient from the “observed” subset in which NGS is required for inclusion and may be performed for worsening clinical status. Here, the time prior to NGS is “immortal,” because patients who died without NGS would not have been observed, but the time after NGS is a period of excess mortality risk compared to the full disease trajectory, because NGS was performed for worsening clinical status.
Figure 2:
Figure 2:
Results for scenario #1, corresponding to uniformly distributed genomic testing times without correlation with survival times. Inference results refer to whether a biomarker association in the correct direction was detected at two-sided p < 0.05 in the observed cohort. (a) Observed biomarker inference results as a function of number of patients observed and calculated biomarker effect in a random sample of the full (true) cohort equal in size to the size of the observed cohort; (b) observed biomarker hazard ratio versus biomarker hazard ratio measured in the true cohort; (c) distribution of Kendall’s tau statistics among simulated observed cohorts; (d) distribution of Cox coefficients for genomic testing time across simulated observed cohorts; (e) distribution of observed median survival time minus true median survival time across cohorts (simple risk set adjustment); (f) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment minus log hazard ratio for biomarker effect measured in the true cohort across simulated cohorts; (g) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment and entry time as a covariate minus log hazard ratio for biomarker effect measured in the true cohort across simulations
Figure 3:
Figure 3:
Results for scenario #2, corresponding to testing times normally distributed around survival times for all patients. Inference results refer to whether a biomarker association in the correct direction was detected at two-sided p < 0.05 in the observed cohort. (a) observed biomarker inference results as a function of number of patients observed and calculated biomarker effect in a random sample of the full (true) cohort equal in size to the size of the observed cohort; (b) observed biomarker hazard ratio versus biomarker hazard ratio measured in the true cohort; (c) distribution of conditional Kendall’s tau statistics among simulated observed cohorts; (d) distribution of Cox coefficients for entry time among simulated observed cohorts; (e) distribution of observed median survival times minus true survival times among simulated cohorts (simple risk set adjustment); (f) distribution of observed log hazard ratio for biomarker effect from Cox models minus log hazard ratio for biomarker effect measured in the true cohort across simulations (simple risk set adjustment); (g) distribution of observed median survival times minus true median survival times among simulated cohorts (transformation modeling); (h) distribution of observed median survival times minus true median survival times among simulated cohorts (copula modeling); (i) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment and entry time as a covariate minus log hazard ratio for biomarker effect measured in the true cohort across simulations
Figure 4:
Figure 4:
Results for scenario #3, corresponding to cohorts in which one subgroup within each cohort undergoes early genomic testing independent of survival time, while a second subgroup exhibits testing normally distributed around survival times. Inference results refer to whether a biomarker association in the correct direction was detected at two-sided p < 0.05 in the observed cohort. (a) Observed biomarker inference results as a function of number of patients observed and calculated biomarker effect in a random sample of the full (true) cohort equal in size to the size of the observed cohort; (b) observed biomarker hazard ratio versus biomarker hazard ratio measured in the true cohort; (c) distribution of conditional Kendall’s tau statistics among simulated observed cohorts as a function of the proportion of patients genotyped “early,” or independently of clinical risk; (d) distribution of Kendall’s tau statistics among simulated observed cohorts; (e) distribution of Cox coefficients for genomic testing time among simulated observed cohorts; (f) distribution of observed median survival times minus true median survival times among simulated cohorts (simple risk set adjustment); (g) distribution of observed log hazard ratio for biomarker effect from Cox models minus log hazard ratio for biomarker effect (simple risk set adjustment) (h) distribution of observed median survival times minus true median survival times among simulated cohorts (transformation modeling); (i) distribution of observed median survival times minus true median survival times among simulated cohorts (copula modeling); (j) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment and entry time as a covariate minus log hazard ratio for biomarker effect measured in the true cohort across simulations
Figure 5:
Figure 5:
Results for Scenario #4, corresponding to cohorts in which genomic testing events and survival events are each exponentially distributed following progression events. Inference results refer to whether a biomarker association in the correct direction was detected at two-sided p < 0.05 in the observed cohort. (a) Observed biomarker inference results as a function of number of patients observed and calculated biomarker effect in a random sample of the full (true) cohort equal in size to the size of the observed cohort; (b) observed biomarker hazard ratio versus biomarker hazard ratio measured in the true cohort; (c) distribution of Kendall’s tau statistics among simulated observed cohorts; (d) distribution of Cox coefficients for entry time among simulated observed cohorts; (e) distribution of observed median survival times minus true median survival times among simulated cohorts (simple risk set adjustment); (f) distribution of observed median survival times minus true median survival times among simulated cohorts (transformation modeling); (g) distribution of observed median survival times minus true median survival times among simulated cohorts (copula modeling); (h) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment minus log hazard ratio for biomarker effect measured in the true cohort across simulations; (i) distribution of observed log hazard ratio for biomarker effect from Cox models with risk set adjustment and entry time as a covariate minus log hazard ratio for biomarker effect measured in the true cohort across simulations

References

    1. AACR Project GENIE Consortium. AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer Discov [Internet] 2017;7:818–31. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28572459 - PMC - PubMed
    1. Singal G, Miller PG, Agarwala V, Li G, Kaushik G, Backenroth D, et al. Association of Patient Characteristics and Tumor Genomics With Clinical Outcomes Among Patients With Non-Small Cell Lung Cancer Using a Clinicogenomic Database. JAMA [Internet] 2019;321:1391–9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/30964529 - PMC - PubMed
    1. Kehl KL, Schrag D, Hassett MJ, Uno H. Assessment of Temporal Selection Bias in Genomic Testing in a Cohort of Patients With Cancer. JAMA Netw Open [Internet] 2020;3:e206976. Available from: http://www.ncbi.nlm.nih.gov/pubmed/32511717 - PMC - PubMed
    1. Brown S, Lavery JA, Shen R, Martin AS, Kehl KL, Sweeney SM, et al. Implications of Selection Bias Due to Delayed Study Entry in Clinical Genomic Studies. JAMA Oncol [Internet] 2021; Available from: https://jamanetwork.com/journals/jamaoncology/fullarticle/2785693 - PMC - PubMed
    1. Martin EC, Betensky RA. Testing quasi-independence of failure and truncation times via conditional kendall’s tau. J Am Stat Assoc 2005;100:484–92.

Publication types

LinkOut - more resources