Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Observational Study
. 2023 Mar;32(3):539-554.
doi: 10.1177/09622802221146313. Epub 2022 Dec 26.

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

Affiliations
Observational Study

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

Nima S Hejazi et al. Stat Methods Med Res. 2023 Mar.

Abstract

The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.

Keywords: Variance shrinkage; causal machine learning; differential expression; differential methylation; efficient estimation; nonparametric inference; semiparametric estimation.

PubMed Disclaimer

Conflict of interest statement

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Control of the False Discovery Rate (FDR) across hypothesis testing procedures in a setting with strong exposure effect in 10% of biomarkers and no positivity issues in the exposure mechanism. Upper panel: Control of the FDR using the Benjamini-Hochberg correction. Lower panel: Empirical distributions of false discovery proportions and negative predictive values, as well as of the true positive and true negative rates.
Figure 2.
Figure 2.
Control of the False Discovery Rate (FDR) across hypothesis testing procedures in a setting with strong exposure effect in 10% of biomarkers and notable positivity issues in the exposure mechanism. Upper panel: Control of the FDR using the Benjamini-Hochberg correction. Lower panel: Empirical distributions of false discovery proportions and negative predictive values, as well as of the true positive and true negative rates.
Figure 3.
Figure 3.
Control of the False Discovery Rate (FDR) across hypothesis testing procedures in a setting with strong exposure effect in 30% of biomarkers and no positivity issues in the exposure mechanism. Upper panel: Control of the FDR using the Benjamini-Hochberg correction. Lower panel: Empirical distributions of false discovery proportions and negative predictive values, as well as of the true positive and true negative rates.
Figure 4.
Figure 4.
Control of the False Discovery Rate (FDR) across hypothesis testing procedures in a setting with strong exposure effect in 30% of biomarkers and notable positivity issues in the exposure mechanism. Upper panel: Control of the FDR using the Benjamini-Hochberg correction. Lower panel: Empirical distributions of false discovery proportions and negative predictive values, as well as of the true positive and true negative rates.
Figure 5.
Figure 5.
Evaluation of the top 30 differentially methylated CpGs (orderd left to right) from the complete analysis in terms of median {log10 (adj. p − value)}’s across the three subsampling schemes.

References

    1. Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004; 3: 1–25. - PubMed
    1. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. https://www.R-project.org/.
    1. Smyth GK. Limma: linear models for microarray data. In Bioinformatics and Computational Biology Solutions using R and Bioconductor. Springer, 2005. pp. 397–420.
    1. Law CW, Chen Y, Shi W, et al. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol 2014; 15: R29. - PMC - PubMed
    1. Dudoit S and van der Laan MJ. Multiple testing procedures with applications to genomics. New York, NY: Springer, 2008.

Publication types