Categorical Predictors • flexBART

Background

Many implementations of BART and its extensions represent categorical predictors using several binary indicators, one for each level of each categorical predictor. Axis-aligned decision rules are well-defined with these indicators: they send one level of a categorical predictor to the left and all other levels to the right (or vice versa). Regression trees built with these rules partition the set of all levels of a categorical predictor by recursively removing one level at a time. Unfortunately, most partitions of the levels cannot be built with this “remove one at a time” strategy, meaning these implementations are extremely limited in their ability to “borrow strength’’ across groups of levels. Please see Deshpande (2024) for a more detailed argument against one-hot encoding categorical predictors.

In flexBART, decision rules based on a categorical predictor $X$ take the form $\{X \in \mathcal{C}\}$ where $\mathcal{C}$ is a random subset of the discrete values that $X_{j}$ can assume. To support such decision rules, flexBART implements a new decision rule prior for categorical predictors in which, conditionally on the splitting variable $X,$ the “cutset” $\mathcal{C}$ is formed by looping over all available values of $X$ and randomly assigning each to $\mathcal{C}$ with probability 1/2.

If a variable $X$ contained in train_data is of class “factor”, flexBART determines the initial set of available values of $X$ by looking at levels of the corresponding element of train_data. As a result, flexBART is able to make predictions at new values of a categorical predictor not present in the training data so long as these values are included as levels of that predictor.

Internal Pre-Processing

Although flexBART will convert predictors of type “character” into “factor” variables, it is strongly recommended that you ensure all categorical predictors are of class “factor” before calling flexBART(). Internally, flexBART() represents the different levels of each categorical predictor with consecutive non-negative integers. The dinfo attribute of flexBART()’s output is, essentially, a list whose element cat_mapping_list records the mappings of all levels of a categorical predictor to the internal non-negative integer representations. While it is unlikely that you will ever need this in practice, the function predict.flexBART() relies on this mapping to make predictions at new inputs using a fitted model.

Example

n_train <- 10
p_cont <- 10
p_cat <- 10

train_data <- data.frame(Y = rep(NA, times = n_train))
for(j in 1:p_cont) train_data[,paste0("X",j)] <- runif(n = n_train, min = 0, max = 1)
for(j in 1:p_cat){
  train_data[,paste0("X",j+p_cont)] <- 
    factor(sample(LETTERS[1:10], size = n_train, replace = TRUE), 
           levels = LETTERS[1:10])
}

friedman_part1 <- function(df){
  return(10 * sin(pi*df$X1 * df$X1)) 
}
friedman_part2 <- function(df){
  return(10 * (df$X3 - 0.5)^2)
}
friedman_part3 <- function(df){
  return(10 * (df$X3 - 0.5)^2)
}
 
friedman_part4 <- function(df){
  return(10*df$X4 + 5 * df$X5)
}

Code

mu_true <- function(df){
  stopifnot(all(paste0("X",c(1:5, 11)) %in% colnames(df))) 
  stopifnot(is.factor(df$X11))
  stopifnot(all(levels(df$X11) %in% LETTERS[1:10]))
  
  tmp_f1 <- friedman_part1(df)
  tmp_f2 <- friedman_part2(df)
  tmp_f3 <- friedman_part3(df)
  
  mu <- rep(NA, times = nrow(df))

  
  ixA <- which(df$X11 == "A")
  if(length(ixA) > 0){
    mu[ixA] <- tmp_f1[ixA] + tmp_f2[ixB]
  }
  
  ixB <- which(df$X11 == "B")
  if(length(ixB) > 0){
    mu[ixB] <- tmp_f3[ixB] + tmp_f4[ixB]
  }
  
  ixC <- which(df$X11 == "C")
  if(length(ixC) > 0){
    mu[ixC] <- tmp_f1[ixC] + tmp_f3[ixC]
  }
  
  ixD <- which(df$X11 == "D")
  if(length(ixD) > 0){
    mu[ixD] <- tmp_f2[ixD] + tmp_f3[ixD]
  }
  
  ixE <- which(df$X11 == "E")
  if(length(ixE) > 0){
    mu[ixE] <- tmp_f1[ixE] + tmp_f4[ixE]
  }
  
  ixF <- which(df$X11 == "F")
  if(length(ixF) > 0){
    mu[ixF] <- tmp_f2[ixF] + tmp_f4[ixF]
  }
  
  ixG <- which(df$X11 == "G")
  if(length(ixG) > 0){
    mu[ixG] <- tmp_f1[ixG] + tmp_f2[ixG] + tmp_f3[ixG]
  }
  
  ixH <- which(df$X11 == "H")
  if(length(ixH) > 0){
    mu[ixH] <- tmp_f1[ixH] + tmp_f2[ixH] + tmp_f4[ixH]
  }
  
  ixI <- which(df$X11 == "I")
  if(length(ixI) > 0){
    mu[ixI] <- tmp_f1[ixI] + tmp_f3[ixI] + tmp_f4[ixI]
  }
  
  ixJ <- which(df$X11 == "J")
  if(length(ixJ) > 0){
    mu[ixJ] <- tmp_f2[ixJ] + tmp_f3[ixJ] + tmp_f4[ixJ]
  }
  
  return(mu)
  
}

Making Predictions at a New Level

There is currently no way to induce flexBART to make a prediction at a new level if that level was not passed at training time. You would need to

Structured Categorical Predictors

flexBART provides support for structured categorical predictors for which there are a priori preferences about which levels of the predictor ought to be clustered together. These preferences are operationalized with a network whose vertices correspond to the levels of the predictor and whose edges encode co-clustering preference. An example would be spatial regions where each region is represented by a vertex in a network and an edge is drawn between vertices whose corresponding regions are geographically/spatially adjacent. See this article for more details.