. 2024 Jan;66(1):e2200178.

doi: 10.1002/bimj.202200178. Epub 2023 Dec 10.

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

Liangyuan Hu¹

Affiliations

PMID: 38072661
PMCID: PMC10953775
DOI: 10.1002/bimj.202200178

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

Liangyuan Hu. Biom J. 2024 Jan.

. 2024 Jan;66(1):e2200178.

doi: 10.1002/bimj.202200178. Epub 2023 Dec 10.

Author

Liangyuan Hu¹

Affiliation

¹ Department of Biostatistics and Epidemiology, Rutgers University, Piscataway, New Jersey, USA.

PMID: 38072661
PMCID: PMC10953775
DOI: 10.1002/bimj.202200178

Abstract

We recently developed a new method random-intercept accelerated failure time model with Bayesian additive regression trees (riAFT-BART) to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection. Leveraging the likelihood-based machine learning, we describe a way in which we can draw posterior samples of the individual survival treatment effect from riAFT-BART model runs, and use the drawn posterior samples to perform an exploratory treatment effect heterogeneity analysis to identify subpopulations who may experience differential treatment effects than population average effects. There is sparse literature on methods for variable selection among clustered and censored survival data, particularly ones using flexible modeling techniques. We propose a permutation-based approach using the predictor's variable inclusion proportion supplied by the riAFT-BART model for variable selection. To address the missing data issue frequently encountered in health databases, we propose a strategy to combine bootstrap imputation and riAFT-BART for variable selection among incomplete clustered survival data. We conduct an expansive simulation study to examine the practical operating characteristics of our proposed methods, and provide empirical evidence that our proposed methods perform better than several existing methods across a wide range of data scenarios. Finally, we demonstrate the methods via a case study of predictors for in-hospital mortality among severe COVID-19 patients and estimating the heterogeneous treatment effects of three COVID-specific medications. The methods developed in this work are readily available in the $R$ package $riAFTBART$ .

Keywords: Bayesian machine learning; clustered survival observations; treatment effect heterogeneity; variable importance.

PubMed Disclaimer

Conflict of interest statement

The author declares no conflicts of interest.

Figures

**Figure 1**
Relative biases (Panel A) and root-mean-squared-errors (RMSE) (Panel B) among 40 generalized propensity score subgroups under 6 data configurations: (heterogeneity settings a, b, c) × (proportional hazards (PH) and nonproportional hazards (nPH)) for each of four methods, IPW-riCox, DR-riAH, PEAMM and riAFT-BART. Three pairwise treatment effects were estimated by averaging the individual survival treatment effect (based on 3-week survival probability) across individuals in each subgroup. Each boxplot visualizes the distribution of relative biases or the distribution of RMSE for 40 subgroups, each averaged across 250 simulation runs.

**Figure 2**
The distribution, across 250 data replications, of the numbers of selected noise predictors and useful predictors for each of five methods: riAFT-BART, PEAMM, FrailtyHL, FrailtyPenal and riCox, with clustered survival data generated under both proportional hazards (PH) and non-proportional hazards (nPH). The total number of useful predictors is 8 and the total number of noise predictors is 20. There are K = 10 clusters, each with a size of 200; the total sample size is 2000. The overall proportion of missingness is 40%.

**Figure 3**
Power of each of five methods: riAFT-BART, PEAMM, FrailtyHL, FrailtyPenal and riCox, for selecting each of 8 useful predictors with clustered survival data generated under proportional hazards (PH) and non-proportional hazards (nPH), based on 250 data replications. There are K = 10 clusters, each with a size of 200; the total sample size is 2000. The overall proportion of missingness is 40%. Filled symbols represent the PH setting, and open symbols correspond to the nPH setting.

**Figure 4**
The distribution of cross-validated concordance statistics across 250 data replications for each of five methods using the COVID-19 dataset.

**Figure 5**
Final Random Forests model fit to the posterior mean of the individual survival treatment effect comparing remdesivir and dexamethasone + remdesivir. Values in each node correspond to the posterior mean, in terms of difference in log survival days, for the subgroup of individuals represented in that node. Uncertainty intervals were obtained by pooling the posterior samples arising from the multiple imputed data sets. WBC: White blood cell.

See this image and copyright information in PMC

Cited by

Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series.
Hu L, Li L. Hu L, et al. Int J Environ Res Public Health. 2022 Dec 1;19(23):16080. doi: 10.3390/ijerph192316080. Int J Environ Res Public Health. 2022. PMID: 36498153 Free PMC article. Review.
A Flexible Approach for Assessing Heterogeneity of Causal Treatment Effects on Patient Survival Using Large Datasets with Clustered Observations.
Hu L, Ji J, Liu H, Ennis R. Hu L, et al. Int J Environ Res Public Health. 2022 Nov 12;19(22):14903. doi: 10.3390/ijerph192214903. Int J Environ Res Public Health. 2022. PMID: 36429621 Free PMC article.

References

1. Androulakis E, Koukouvinos C, and Vonta F (2012). Estimation and variable selection via frailty models with penalized likelihood. Statistics in Medicine 31, 2223–2239. - PubMed
1. Arpino B and Cannas M (2016). Propensity score matching with clustered data. an application to the estimation of the impact of caesarean section on the apgar score. Statistics in Medicine 35, 2074–2091. - PubMed
1. Bender A, Groll A, and Scheipl F (2018). A generalized additive model approach to time-to-event analysis. Statistical Modelling 18, 299–321.
1. Bleich J, Kapelner A, George EI, and Jensen ST (2014). Variable selection for BART: an application to gene regulation. The Annals of Applied Statistics 8, 1750–1781.
1. Breiman L. (2001). Random forests. Machine Learning 45, 5–32.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

Affiliation

A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

Author

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources