Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Charles Spanbauer¹, Wei Pan¹; ADNI, The Alzheimer's Disease Neuroimaging Initiative¹

Affiliations

PMID: 36349692
PMCID: PMC9892284
DOI: 10.1002/gepi.22505

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Charles Spanbauer et al. Genet Epidemiol. 2023 Feb.

. 2023 Feb;47(1):26-44.

doi: 10.1002/gepi.22505. Epub 2022 Nov 9.

Authors

Charles Spanbauer¹, Wei Pan¹; ADNI, The Alzheimer's Disease Neuroimaging Initiative¹

Affiliation

¹ Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA.

PMID: 36349692
PMCID: PMC9892284
DOI: 10.1002/gepi.22505

Abstract

Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.

Keywords: ensemble learning; genetics; high-dimensional prediction; sparsity.

PubMed Disclaimer

Figures

**Figure 1**
Depiction of the workflow for performing a genome‐wide scan using BART while also incorporating the functional annotations. Note how the outcome $Y$ affects the terminal node values, while the predictors and annotations, $X$ and $A$ , affect the splitting rules for the interior nodes.

**Figure 2**
Toy example depicting a regression tree ensemble with $H = 3$ along with the resulting partition of the predictor space. The numbers on the right are the values in each partition of the predictor space after adding the appropriate terminal node values on the left. The horizontal lines at 0.25 and 0.75 represent the partitions in $x_{2}$ while the vertical lines at 0.3, 0.5, and 0.7 represent the partitions in $x_{1}$ . This demonstrates how regression tree ensembles can reliably estimate nonlinearity and interaction.

**Figure 3**
Contours for the pdf of a normally distributed bivariate random variable $ψ$ in the top row and the corresponding contours from the pdf of a three‐category logit normal random variable $s$ . For all five columns the mean of the normally distributed variables is $(0.5, 0)$ . The variances for the first three columns are 1, 4, and 0.25, respectively and the two elements of $ψ$ are independent. For the fourth and fifth column, the correlation between the two elements of $ψ$ is 0.5 and $- 0.5$ , respectively, with variance of 1 for both.

**Figure 4**
The ELPD difference for each model comparison is visualized in this figure with higher ELPD difference indicating preference to the more complicated model. The top‐left panel gives the results for the standard BART versus NULL comparison, the top‐right panel gives the results for the LN‐A versus LN‐0 prior, the bottom‐right panel gives the results for the LN‐0 prior versus BART, and finally the bottom‐left panel gives the results for DART versus BART.

**Figure 5**
The ELPD values for each method are presented here. Dots higher than the red dashed line indicate genes that are more predictive using the model on the vertical axis. The top two panels indicate genes that may have informative annotations as given by ELPD. The bottom‐right panel shows the agreement between DART and LN‐0. Finally, the distribution of the number of *cis*‐SNPs p across the genome is presented in the bottom‐left.

See this image and copyright information in PMC

References

1. Aguiar, V. R. , César, J. , Delaneau, O. , Dermitzakis, E. T. , & Meyer, D. (2019). Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genetics, 15(4), e1008091. 10.1371/journal.pgen.1008091 - DOI - PMC - PubMed
1. Albert, J. , & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. 10.1080/01621459.1993.10476321 - DOI
1. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer‐Verlag.
1. Bleich, J. , Kapelner, A. , George, E. , & Jensen, S. (2014). Variable selection for BART: An application to gene regulation. The Annals of Applied Statistics, 8(3), 1750–1781. 10.1214/14-aoas755 - DOI
1. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. 10.1023/a:1010933404324 - DOI

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Affiliation

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources