Towards a more flexible BART • flexBART

Welcome to version 2.0 of the flexBART package! flexBART (>= 2.0.0) is a new implementation of BART and VCBART that is designed to fit flexible varying coefficient models using ensembles of binary regression trees. In addition to the flexible priors for categorical decision rules introduced in earlier versions, this new version introduces a formula interface and implements a lot of data pre-processing that (hopefully) makes it easier than ever fit BART models.

Installation & Basic Usage

It is highly recommended that you install R version 4.0.0 or later before installing flexBART. Before installing flexBART, ensure that you have set up an appropriate C++ toolchain for your system.

For macOS: we recommend using the macrtools package
For Windows: we recommend using Rtools, which can be downloaded here. Please make sure you download the version of Rtools that corresponds to your R version (e.g., RTools45 for R version 4.5.x)
For Linux: we recommend following these instructions from the Stan development team.

Once your C++ toolchain is configured, you can install flexBART using devtools::install_github:

devtools::install_github(repo = "skdeshpande91/flexBART")

Basic Usage

Starting in version 2.0.0, flexBART features a formula interface and allows users to pass their data as data.frame or tibble objects. So, given a data frame train_data containing named columns for an outcome (e.g., Y) and predictors, you can fit a simple BART model to predict Y using all the predictors by running

flexBART(formula = Y ~ bart(.), train_data = train_data)

flexBART also supports fitting VCBART models of the form

\[ Y = \beta_{0}(X) + \beta_{1}(X)Z_{1} + \cdots + \beta_{R}Z_{R} + \sigma \epsilon; \epsilon \sim N(0,1), \]

where each coefficient function \(\beta_{r}(X)\) is approximated with its own tree ensemble. To fit such a model in flexBART, you can use a formula like Y ~ bart(.) + Z1 * bart(.) + Z2 * bart(.), including a separate bart() for each coefficient function.

The formula interface also provides fine control over the predictor variables used in each ensemble. To allow an ensemble to only split on a few variables (e.g., X1, X2, and X3), you would specify bart(X1 + X2 + X3) and to allow an ensemble to split on all variables except X1 and X2, you would specify bart(.-X1-X2). Note that when it detects multiple ensembles in the formula, flexBART will not include any of the \(Z_{r}\)’s as splitting variables when it expands the . So, to include, say, a piecewise linear function, \(X_{1} * \beta_{1}(X_{1}),\) you would need to specify X1 * bart(X1) in the formula argument.

By default, flexBART simulates 4 Markov chains with 2,000 iterations each and discards the first 1,000 iterations as “burn-in.” The numbers of chains, burn-in iterations, and post-burn-in iterations can be adjusted using the optional arguments n.chains, burn, and nd.

Pre-processing

Like earlier version (e.g., 1.2.0 and earlier), the latest version of flexBART assumes that all continuous predictors are re-scaled to the interval [-1,1] and represents the distinct values of categorical predictors with non-negative integers. But unlike those earlier versions, which required users to perform such re-scaling and conversion themselves, flexBART now automates the pre-processing.

Manually specifying cutpoints for numeric predictors

Internally, flexBART treats all predictors passed as a factor or character as categorical. It then checks whether each numerical predictor is discrete (e.g., age measured in years) or whether it is continuous by looking at the number of pairwise differences between consecutive values. Decision rules based on numerical predictors take the form \(\{X_{j} < c\}.\)

If flexBART detects that \(X_{j}\) is continuous, it will rescale the supplied values of \(X_{j}\) to the interval [-1,1] and allow regression trees to select the cutpoint \(c\) uniformly from that interval. flexBART adds \(0.1\text{sd}(X_{j})\) to the maximum value of \(X_{j}\) and subtracts \(0.1\text{sd}(X_{j})\) from the minimum value of \(X_{j}\) before re-scaling the predictor. If testing data is provided, flexBART determines the min, max, and standard deviation of \(X_{j}\) using both the training and testing data.

If, on the other hand, flexBART determines that \(X_{j}\) is discrete, it will not re-scale the predictor and instead forces regression trees to select the cutpoint \(c\) from the unique values of \(X_{j}.\) If testing data is provided, flexBART determines the unique values of \(X_{j}\) using both the training and testing data.

Priors for categorical predictors

In flexBART ensembles, decision rules based on categorical predictors take the form \(\{X_{j} \in \mathcal{C}\}\) where \(\mathcal{C}\) is a random subset of the discrete values that \(X_{j}\) can assume. This is in stark contrast to most other implementations of BART, which one-hot encode categorical predictors. Please see Deshpande (2024) for arguments against the use of one-hot encoding with BART.

Internally, flexBART determines the set of available values of \(X_{j}\) by looking at the levels() of all predictors saved as factor() variables. As a result, flexBART is able to make predictions at new values of a categorical predictor not present in the training data so long as these values are included as levels of that predictor.

flexBART also includes support for network-structured categorical predictors (e.g., spatial areas with known adjacency structure). To force the “cutset” \(\mathcal{C}\) to correspond to connect components of these networks, you should provide the corresponding adjacency matrices via the adjacency_list argument. This argument should be a named list with one element per network-structured predictor. Each element should be a binary or weighted adjacency matrix whose row and column names correspond to the levels of the predictor. flexBART implements four different priors over decision rues for network-structured predictors. See the documentation and Section 3.2 of Deshpande (2024) for details about these priors.