factors: Character vector with name(s) of factors with rare levels.
data: data.frame containing the variables in the model. Response must be of class factor for classification, numeric for (count) regression, Surv for survival regression. Input variables must be of class numeric, factor or ordered factor. Otherwise, pre will attempt to recode.
sampfrac: numeric value >0 and ≤1. Specifies the fraction of randomly selected training observations used to produce each tree. Values <1 will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrap sampling). Alternatively, a sampling function may be supplied, which should take arguments n (sample size) and weights.
warning: logical. Whether a warning should be printed if observations with rare factor levels are added to the training sample of the current iteration.
Returns
A sampling function, which generates sub- or bootstrap samples as usual in function pre, but checks if all levels of the specified factor(s) are present and adds observation with those levels if not. If warning = TRUE, a warning is issued).
Details
Categorical predictor variables (factors) with rare levels may be problematic in boosting algorithms employing sampling (which is employed by default in function pre).
If a sample in a given boosting iteration does not have any observations with a given (rare) level of a factor, while this level is present in the full training dataset, and the factor is selected for splitting in the tree, then no prediction for that level of the factor can be generated, resulting in an error. Note that boosting methods other than pre that also employ sampling (e.g., gbm or xgboost) may not generate an error in such cases, but also do not document how intermediate predictions are generated in such a case. It is likely that these methods use one-hot-encoding of factors, which from a perspective of model interpretation introduces new problems, especially when the aim is to obtain a sparse set of rules as in pre.
With function pre(), the rare-factor-level issue, if encountered, can be dealt with by the user in one of the following ways (in random order):
Use a sampling function that guarantees inclusion of rare factor levels in each sample. E.g., use rare_level_sampler, yielding a sampling function which creates training samples guaranteed to include each level of specified factor(s). Advantage: No loss of information, easy to implement, guaranteed to solve the issue. Disadvantage: May result in oversampling of observations with rare factor levels, potentially biasing results. The bias is likely small though, and will be larger for smaller sample sizes and sampling fractions, and for larger numbers of rare levels. The latter will also increase computational demands.
Specify learnrate = 0. This results in a (su)bagging instead of boosting approach. Advantage: Eliminates the rare-factor-level issue completely, because intermediate predictions need not be computed. Disadvantage: Boosting with low learning rate often improves predictive accuracy.
Data pre-processing: Before running function pre(), combine rare factor levels with other levels of the factors. Advantage: Limited loss of information. Disadvantage: Likely, but not guaranteed to solve the issue.
Data pre-processing: Apply one-hot encoding to the predictor matrix before applying function pre(). This can easily be done through applying function model.matrix. Advantage: Guaranteed to solve the error, easy to implement. Disadvantage: One-hot-encoding increases the number of predictor variables which may reduce interpretability and, but probably to a lesser extent, accuracy.
Data pre-processing: Remove observations with rare factor levels from the dataset before running function pre(). Advantage: Guaranteed to solve the error. Disadvantage: Removing outliers results in a loss of information, and may bias the results.
Increase the value of sampfrac argument of function pre(). Advantage: Easy to implement. Disadvantage: Larger samples are more likely but not guaranteed to contain all possible factor levels, thus not guaranteed to solve the issue.
Examples
## Create dataset with two factors containing rare levelsdat <- iris[iris$Species !="versicolor",]dat <- rbind(dat, iris[iris$Species =="versicolor",][1:5,])dat$factor2 <- factor(rep(1:21, times =5))## Set up sampling functionsamp_func <- rare_level_sampler(c("Species","factor2"), data = dat, sampfrac =.51, warning =TRUE)## Illustrate what it does N <- nrow(dat)wts <- rep(1, times = nrow(dat))set.seed(3)dat[samp_func(n = N, weights = wts),]# single samplefor(i in1:500) dat[samp_func(n = N, weights = wts),]warnings()# to illustrate warnings that may occur when fitting a full PRE## Illustrate use with function pre:## (Note: low ntrees value merely to reduce computation time for the example)set.seed(42)# iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20) # would yield erroriris.ens <- pre(Petal.Width ~ . , data = dat, ntrees =20, sampfrac = samp_func)# should work