This function provides a list of row indices used for k-fold cross-validation (basic, stratified, grouped, or blocked). Repeated fold creation is supported as well. By default, in-sample indices are returned.
create_folds( y, k =5L, type = c("stratified","basic","grouped","blocked"), n_bins =10L, m_rep =1L, use_names =TRUE, invert =FALSE, shuffle =FALSE, seed =NULL)
Arguments
y: Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
k: Number of folds.
type: Split type. One of "stratified" (default), "basic", "grouped", "blocked".
n_bins: Approximate numbers of bins for numeric y
(only for type = "stratified").
m_rep: How many times should the data be split into k folds? Default is 1, i.e., no repetitions.
use_names: Should folds be named? Default is TRUE.
invert: Set to TRUE in order to receive out-of-sample indices. Default is FALSE, i.e., in-sample indices are returned.
shuffle: Should row indices be randomly shuffled within folds? Default is FALSE.
seed: Integer random seed.
Returns
If invert = FALSE (the default), a list with in-sample row indices. If invert = TRUE, a list with out-of-sample indices.
Details
By default, the function uses stratified splitting. This will balance the folds regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in sequential groups.
Examples
y <- rep(c(letters[1:4]), each =5)create_folds(y)create_folds(y, k =2)create_folds(y, k =2, m_rep =2)create_folds(y, k =3, type ="blocked")