Fitting logistic models with flexBART • flexBART

Overview

flexBART now supports fitting logistic models with Bayesian ensembles of regression trees. Specifically, it can be used to fit models of the form \[ Y \sim \text{Binomial}(p) \] where the log-odds of $p$ can be modeled as a linear combination of covariates \[ \log\left(\frac{p}{1-p}\right) = \beta_{0}(X) + \beta_{1}(X) \times Z_{1} + \cdots + \beta_{R-1}(X) \times Z_{R-1} \] and each function $\beta_{r}(X)$ is approximated with an ensemble of regression trees. flexBART allows the size (specified using the optional argument M_vec or M), the tree prior hyperparameters (specified using the optional arguments alpha_vec and beta_vec), and the variables used for splitting (specified via the formula argument) to vary in each ensemble.

Example

We first demonstrate logit_flexBART’s functionality using data generated from the model $p = \beta_{0}(X)$, $Y \sim \text{Binomial}(p)$ where $X$ contains $p_{\text{cont}} = 5$ continuous covariates.

We generate $n = 4000$ training observations and $n = 1000$ testing observations.

set.seed(101)
n_train <- 4000
n_test <- 1000
p_cont <- 5

train_data <- data.frame(Y = rep(NA, times = n_train))
test_data <- data.frame(Y = rep(NA, times = n_test))
for(j in 1:p_cont){
  train_data[,paste0("X",j)] <- 
    runif(n = n_train, min = -1, max = 1)
  test_data[,paste0("X",j)] <- 
    runif(n = n_test, min = -1, max = 1)
}

We define the true log-odds function below.

lambda_true <- function(df){
  return(cos(2 * pi * df[,"X3"]) + df[,"X1"]^3)
}

We now evaluate the true slopes and response surface function $\mathbb{E}[Y \vert X, Z]$ on the training and testing observations

lambda_train <- lambda_true(train_data)
lambda_test <- lambda_true(test_data)

prob_train <- 1/(1 + exp(-lambda_train))
prob_test <- 1/(1 + exp(-lambda_test))
train_data[,"Y"] <- rbinom(n = length(lambda_train), size = 1, p = prob_train)

We’re now ready to fit our model.

fit <-
  flexBART::flexBART(formula = Y ~ bart(.),
                     family = binomial(link = "logit"),
                     train_data = train_data,
                     test_data = test_data
                     )

flexBART::flexBART returns the posterior mean estimate and posterior samples for each $\beta_{j}(x)$ evaluation (outputs named beta) and for the whole response surface (outputs named prob). We first plot the posterior mean evaluations of each $\beta_{j}$ and the response surface for both training and testing.

# Okabe-Ito colorblind friendly palette
oi_colors <- c("#999999", "#E69F00", "#56B4E9", "#009E73", 
               "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
par(mar = c(3,3,2,1), mgp = c(1.8, 0.5, 0), mfrow = c(1,2))

plot(prob_train, fit$prob.train.mean, pch = 16, cex = 0.5,
     xlab = "Actual", ylab = "Posterior Mean", main = "Response surface (train)")
abline(a = 0, b = 1, col = oi_colors[3])

plot(prob_test, fit$prob.test.mean, pch = 16, cex = 0.5,
     xlab = "Actual", ylab = "Posterior Mean", main = "Response surface (test)")
abline(a = 0, b = 1, col = oi_colors[3])

### Pointwise uncertainty We can also assess the coverage of the pointwise credible intervals

coverage <- 
  matrix(nrow = 2, ncol = 1,
         dimnames = list(c("train", "test"), "p"))

p_train_l95 <- 
  apply(fit$prob.train, MARGIN = 2, FUN = quantile, probs = 0.025)
p_train_u95 <- 
  apply(fit$prob.train, MARGIN = 2, FUN = quantile, probs = 0.975)

p_test_l95 <- 
  apply(fit$prob.test, MARGIN = 2, FUN = quantile, probs = 0.025)
p_test_u95 <- 
  apply(fit$prob.test, MARGIN = 2, FUN = quantile, probs = 0.975)

coverage["train", "p"] <-
  mean(prob_train >= p_train_l95 & prob_train <= p_train_u95)

coverage["test", "p"] <-
  mean(prob_test >= p_test_l95 & prob_test <= p_test_u95)

print(round(coverage, digits = 3))
#>           p
#> train 0.997
#> test  0.980

Variable selection

By default flexBART::flexBART uses Linero (2018)’s sparsity-inducing Dirichlet prior on splitting variables. Following the recommendations from Section 3.2 of that paper, we can compute the posterior probability that each $X_{j}$ is used at least once in the ensemble for each $\beta_{r}(X).$ We can then select those covariates for which this probability exceeds 50%. The varcounts element of the output is an array holding the posterior samples of the counts of rules based on covariate.

dim(fit$varcounts)
#> [1] 4000    5    1

The first dimension of fit$varcounts indexes MCMC samples. The second indexes covariates (i.e., elements of $X$). The third indexes ensembles (i.e., $\beta_{0}$, , $\beta_{R-1}$).

The selected variables are

marginal_split_prob <- 
  apply(fit$varcounts >= 1, MARGIN = c(2,3), FUN = mean)
which(marginal_split_prob >= 0.5, arr.ind = TRUE)
#>    row col
#> X1   1   1
#> X3   3   1

We see that $X_{1}$ $X_{3}$ are selected. So, at least for this run, there are no false positive and no false negative identifications.