nestcv.SuperLearner function

Outer cross-validation of SuperLearner model

Outer cross-validation of SuperLearner model

Provides a single loop of outer cross-validation to evaluate performance of ensemble models from SuperLearner package.

nestcv.SuperLearner( y, x, filterFUN = NULL, filter_options = NULL, weights = NULL, balance = NULL, balance_options = NULL, modifyX = NULL, modifyX_useY = FALSE, modifyX_options = NULL, outer_method = c("cv", "LOOCV"), n_outer_folds = 10, outer_folds = NULL, parallel_mode = NULL, cv.cores = 1, final = TRUE, na.option = "pass", verbose = TRUE, ... )

Arguments

  • y: Response vector

  • x: Dataframe or matrix of predictors. Matrix will be coerced to dataframe as this is the default for SuperLearner.

  • filterFUN: Filter function, e.g. ttest_filter or relieff_filter . Any function can be provided and is passed y and x. Ideally returns a numeric vector with indices of filtered predictors. The custom function can return a character vector of names of the filtered predictors, but this will not work with the penalty.factor argument in nestcv.glmnet().

  • filter_options: List of additional arguments passed to the filter function specified by filterFUN.

  • weights: Weights applied to each sample for models which can use weights. Note weights and balance cannot be used at the same time. Weights are not applied in filters.

  • balance: Specifies method for dealing with imbalanced class data. Current options are "randomsample" or "smote". Not available if outercv is called with a formula. See randomsample() and smote()

  • balance_options: List of additional arguments passed to the balancing function

  • modifyX: Character string specifying the name of a function to modify x. This can be an imputation function for replacing missing values, or a more complex function which alters or even adds columns to x. The required return value of this function depends on the modifyX_useY

    setting.

  • modifyX_useY: Logical value whether the x modifying function makes use of response training data from y. If FALSE then the modifyX

    function simply needs to return a modified x object, which will be coerced to a dataframe as required by SuperLearner. If TRUE then the modifyX function must return a model type object on which predict() can be called, so that train and test partitions of x can be modified independently.

  • modifyX_options: List of additional arguments passed to the x

    modifying function

  • outer_method: String of either "cv" or "LOOCV" specifying whether to do k-fold CV or leave one out CV (LOOCV) for the outer folds

  • n_outer_folds: Number of outer CV folds

  • outer_folds: Optional list containing indices of test folds for outer CV. If supplied, n_outer_folds is ignored.

  • parallel_mode: Either "mclapply" or "snow". This determines which parallel backend to use. The default is parallel::mclapply on unix/mac and snow on windows. snow uses parallelisation via SuperLearner::snowSuperLearner.

  • cv.cores: Number of cores for parallel processing of the outer loops.

  • final: Logical whether to fit final model.

  • na.option: Character value specifying how NAs are dealt with. "omit" is equivalent to na.action = na.omit. "omitcol" removes cases if there are NA in 'y', but columns (predictors) containing NA are removed from 'x' to preserve cases. Any other value means that NA are ignored (a message is given).

  • verbose: Logical whether to print messages and show progress

  • ...: Additional arguments passed to SuperLearner::SuperLearner()

Returns

An object with S3 class "nestcv.SuperLearner" - call: the matched call

  • output: Predictions on the left-out outer folds

  • outer_result: List object of results from each outer fold containing predictions on left-out outer folds, model result and number of filtered predictors at each fold.

  • dimx: vector of number of observations and number of predictors

  • y: original response vector

  • yfinal: final response vector (post-balancing)

  • outer_folds: List of indices of outer test folds

  • final_fit: Final fitted model on whole data

  • final_vars: Column names of filtered predictors entering final model

  • summary_vars: Summary statistics of filtered predictors

  • roc: ROC AUC for binary classification where available.

  • summary: Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression.

Details

This performs an outer CV on SuperLearner package ensemble models to measure performance, allowing balancing of imbalanced datasets as well as filtering of predictors. SuperLearner prefers dataframes as inputs for the predictors. If x is a matrix it will be coerced to a dataframe and variable names adjusted by make.names().

Parallelisation of the outer CV folds is available on linux/mac, but not available on windows. On windows, snowSuperLearner() is called instead, so that parallelisation is performed across each call to SuperLearner.

Note

Care should be taken with some SuperLearner models e.g. SL.gbm as some models have multicore enabled by default, which can lead to huge numbers of processes being spawned.

See Also

SuperLearner::SuperLearner()

  • Maintainer: Myles Lewis
  • License: MIT + file LICENSE
  • Last published: 2025-03-10