Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb;47(1):26-44.
doi: 10.1002/gepi.22505. Epub 2022 Nov 9.

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Affiliations

Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles

Charles Spanbauer et al. Genet Epidemiol. 2023 Feb.

Abstract

Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.

Keywords: ensemble learning; genetics; high-dimensional prediction; sparsity.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Depiction of the workflow for performing a genome‐wide scan using BART while also incorporating the functional annotations. Note how the outcome Y affects the terminal node values, while the predictors and annotations, X and A, affect the splitting rules for the interior nodes.
Figure 2
Figure 2
Toy example depicting a regression tree ensemble with H=3 along with the resulting partition of the predictor space. The numbers on the right are the values in each partition of the predictor space after adding the appropriate terminal node values on the left. The horizontal lines at 0.25 and 0.75 represent the partitions in x2 while the vertical lines at 0.3, 0.5, and 0.7 represent the partitions in x1. This demonstrates how regression tree ensembles can reliably estimate nonlinearity and interaction.
Figure 3
Figure 3
Contours for the pdf of a normally distributed bivariate random variable ψ in the top row and the corresponding contours from the pdf of a three‐category logit normal random variable s. For all five columns the mean of the normally distributed variables is (0.5,0). The variances for the first three columns are 1, 4, and 0.25, respectively and the two elements of ψ are independent. For the fourth and fifth column, the correlation between the two elements of ψ is 0.5 and 0.5, respectively, with variance of 1 for both.
Figure 4
Figure 4
The ELPD difference for each model comparison is visualized in this figure with higher ELPD difference indicating preference to the more complicated model. The top‐left panel gives the results for the standard BART versus NULL comparison, the top‐right panel gives the results for the LN‐A versus LN‐0 prior, the bottom‐right panel gives the results for the LN‐0 prior versus BART, and finally the bottom‐left panel gives the results for DART versus BART.
Figure 5
Figure 5
The ELPD values for each method are presented here. Dots higher than the red dashed line indicate genes that are more predictive using the model on the vertical axis. The top two panels indicate genes that may have informative annotations as given by ELPD. The bottom‐right panel shows the agreement between DART and LN‐0. Finally, the distribution of the number of cis‐SNPs p across the genome is presented in the bottom‐left.

References

    1. Aguiar, V. R. , César, J. , Delaneau, O. , Dermitzakis, E. T. , & Meyer, D. (2019). Expression estimation and eQTL mapping for HLA genes with a personalized pipeline. PLoS Genetics, 15(4), e1008091. 10.1371/journal.pgen.1008091 - DOI - PMC - PubMed
    1. Albert, J. , & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422), 669–679. 10.1080/01621459.1993.10476321 - DOI
    1. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer‐Verlag.
    1. Bleich, J. , Kapelner, A. , George, E. , & Jensen, S. (2014). Variable selection for BART: An application to gene regulation. The Annals of Applied Statistics, 8(3), 1750–1781. 10.1214/14-aoas755 - DOI
    1. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. 10.1023/a:1010933404324 - DOI

Publication types

LinkOut - more resources