In modelling, a baseline is a result that is meaningful to compare the results from our models to. In regression, we want our model to be better than a model without any predictors. If our model does not perform better than such a simple model, it's unlikely to be useful.
baseline_gaussian() fits the intercept-only model (y ~ 1) on n random subsets of train_data and evaluates each model on test_data. Additionally, it evaluates a model fitted on all rows in train_data.
dependent_col: Name of dependent variable in the supplied test and training sets.
n: The number of random samplings of train_data to fit baseline models on. (Default is 100)
metrics: list for enabling/disabling metrics.
E.g. list("RMSE" = FALSE) would remove RMSE from the results, and list("TAE" = TRUE) would add the Total Absolute Error metric to the results. Default values (TRUE/FALSE) will be used for the remaining available metrics.
You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why f.i. list("all" = FALSE, "RMSE" = TRUE)
would return only the RMSE metric.
The list can be created with gaussian_metrics().
Also accepts the string "all".
random_effects: Random effects structure for the baseline model. (Character)
E.g. with "(1|ID)", the model becomes "y ~ 1 + (1|ID)".
min_training_rows: Minimum number of rows in the random subsets of train_data.
min_training_rows_left_out: Minimum number of rows left out of the random subsets of train_data.
See the additional metrics (disabled by default) at ?gaussian_metrics.
The Measure column indicates the statistical descriptor used on the evaluations. The row where Measure == All_rows is the evaluation when the baseline model is trained on all rows in train_data.
The Training Rows column contains the aggregated number of rows used from train_data, when fitting the baseline models.
A nested tibble with the coefficients of the baseline models.
Number of training rows used when fitting the baseline model on the training set.
A nested Process information object with information about the evaluation.
Name of dependent variable.
Name of fixed effect (bias term only).
Random effects structure (if specified).
Details
Packages used:
Models
stats::lm, lme4::lmer
Results
r2m : MuMIn::r.squaredGLMM
r2c : MuMIn::r.squaredGLMM
AIC : stats::AIC
AICc : MuMIn::AICc
BIC : stats::BIC
Examples
# Attach packageslibrary(cvms)library(groupdata2)# partition()library(dplyr)# %>% arrange()# Data is part of cvmsdata <- participant.scores
# Set seed for reproducibilityset.seed(1)# Partition datapartitions <- partition(data, p =0.7, list_out =TRUE)train_set <- partitions[[1]]test_set <- partitions[[2]]# Create baseline evaluations# Note: usually n=100 is a good settingbaseline_gaussian( test_data = test_set, train_data = train_set, dependent_col ="score", random_effects ="(1|session)", n =2)# Parallelize evaluations# Attach doParallel and register four cores# Uncomment:# library(doParallel)# registerDoParallel(4)# Make sure to uncomment the parallel argumentbaseline_gaussian( test_data = test_set, train_data = train_set, dependent_col ="score", random_effects ="(1|session)", n =4#, parallel = TRUE # Uncomment)
See Also
Other baseline functions: baseline(), baseline_binomial(), baseline_multinomial()