baseline_gaussian function

Create baseline evaluations for regression models

Create baseline evaluations for regression models

lifecycle::badge("maturing")

Create a baseline evaluation of a test set.

In modelling, a baseline is a result that is meaningful to compare the results from our models to. In regression, we want our model to be better than a model without any predictors. If our model does not perform better than such a simple model, it's unlikely to be useful.

baseline_gaussian() fits the intercept-only model (y ~ 1) on n random subsets of train_data and evaluates each model on test_data. Additionally, it evaluates a model fitted on all rows in train_data.

baseline_gaussian( test_data, train_data, dependent_col, n = 100, metrics = list(), random_effects = NULL, min_training_rows = 5, min_training_rows_left_out = 3, REML = FALSE, parallel = FALSE )

Arguments

  • test_data: data.frame.

  • train_data: data.frame.

  • dependent_col: Name of dependent variable in the supplied test and training sets.

  • n: The number of random samplings of train_data to fit baseline models on. (Default is 100)

  • metrics: list for enabling/disabling metrics.

    E.g. list("RMSE" = FALSE) would remove RMSE from the results, and list("TAE" = TRUE) would add the Total Absolute Error metric to the results. Default values (TRUE/FALSE) will be used for the remaining available metrics.

    You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why f.i. list("all" = FALSE, "RMSE" = TRUE)

    would return only the RMSE metric.

    The list can be created with gaussian_metrics().

    Also accepts the string "all".

  • random_effects: Random effects structure for the baseline model. (Character)

    E.g. with "(1|ID)", the model becomes "y ~ 1 + (1|ID)".

  • min_training_rows: Minimum number of rows in the random subsets of train_data.

  • min_training_rows_left_out: Minimum number of rows left out of the random subsets of train_data.

    I.e. a subset will maximally have the size:

    ``max_rows_in_subset = nrow(train_data) - `min_training_rows_left_out```.

  • REML: Whether to use Restricted Maximum Likelihood. (Logical)

  • parallel: Whether to run the n evaluations in parallel. (Logical)

    Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.

Returns

list containing:

  1. a tibble with summarized results (called summarized_metrics)
  2. a tibble with random evaluations (random_evaluations)

....................................................................

The Summarized Results tibble contains:

Average ‘RMSE’ , ‘MAE’ , ‘NRMSE(IQR)’ , ‘RRSE’ , ‘RAE’ , ‘RMSLE’ .

See the additional metrics (disabled by default) at ?gaussian_metrics.

The Measure column indicates the statistical descriptor used on the evaluations. The row where Measure == All_rows is the evaluation when the baseline model is trained on all rows in train_data.

The Training Rows column contains the aggregated number of rows used from train_data, when fitting the baseline models.

....................................................................

The Random Evaluations tibble contains:

The non-aggregated metrics .

A nested tibble with the predictions and targets.

A nested tibble with the coefficients of the baseline models.

Number of training rows used when fitting the baseline model on the training set.

A nested Process information object with information about the evaluation.

Name of dependent variable.

Name of fixed effect (bias term only).

Random effects structure (if specified).

Details

Packages used:

Models

stats::lm, lme4::lmer

Results

r2m : MuMIn::r.squaredGLMM

r2c : MuMIn::r.squaredGLMM

AIC : stats::AIC

AICc : MuMIn::AICc

BIC : stats::BIC

Examples

# Attach packages library(cvms) library(groupdata2) # partition() library(dplyr) # %>% arrange() # Data is part of cvms data <- participant.scores # Set seed for reproducibility set.seed(1) # Partition data partitions <- partition(data, p = 0.7, list_out = TRUE) train_set <- partitions[[1]] test_set <- partitions[[2]] # Create baseline evaluations # Note: usually n=100 is a good setting baseline_gaussian( test_data = test_set, train_data = train_set, dependent_col = "score", random_effects = "(1|session)", n = 2 ) # Parallelize evaluations # Attach doParallel and register four cores # Uncomment: # library(doParallel) # registerDoParallel(4) # Make sure to uncomment the parallel argument baseline_gaussian( test_data = test_set, train_data = train_set, dependent_col = "score", random_effects = "(1|session)", n = 4 #, parallel = TRUE # Uncomment )

See Also

Other baseline functions: baseline(), baseline_binomial(), baseline_multinomial()

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

  • Maintainer: Ludvig Renbo Olsen
  • License: MIT + file LICENSE
  • Last published: 2025-03-07