Train linear or logistic regression models on a training set and validate it by predicting a test/validation set. Returns results in a tibble for easy reporting, along with the trained models.
See validate_fn() for use with custom model functions.
Can contain a grouping factor for identifying partitions - as made with groupdata2::partition(). See partitions_col.
formulas: Model formulas as strings. (Character)
E.g. c("y~x", "y~z").
Can contain random effects.
E.g. c("y~x+(1|r)", "y~z+(1|r)").
family: Name of the family. (Character)
Currently supports ‘"gaussian"’ for linear regression with lm() / lme4::lmer()
and ‘"binomial"’ for binary classification with glm() / lme4::glmer().
See cross_validate_fn() for use with other model functions.
test_data: data.frame. If specifying partitions_col, this can be NULL.
partitions_col: Name of grouping factor for identifying partitions. (Character)
Rows with the value 1 in partitions_col are used as training set and rows with the value 2 are used as test set.
N.B. Only used if ‘test_data’ is ‘NULL’ .
control: Construct control structures for mixed model fitting (with lme4::lmer() or lme4::glmer()). See lme4::lmerControl and lme4::glmerControl.
N.B. Ignored if fitting lm() or glm() models.
REML: Restricted Maximum Likelihood. (Logical)
cutoff: Threshold for predicted classes. (Numeric)
N.B. Binomial models only
positive: Level from dependent variable to predict. Either as character (preferable) or level index (1 or 2 - alphabetically).
E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".
Note: For reproducibility, it's preferable to specify the name directly , as different locales may sort the levels differently.
Used when calculating confusion matrix metrics and creating ROC curves.
The Process column in the output can be used to verify this setting.
N.B. Only affects evaluation metrics, not the model training or returned predictions.
N.B. Binomial models only .
metrics: list for enabling/disabling metrics.
E.g. list("RMSE" = FALSE) would remove RMSE from the results, and list("Accuracy" = TRUE) would add the regular Accuracy metric to the classification results. Default values (TRUE/FALSE) will be used for the remaining available metrics.
You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why list("all" = FALSE, "RMSE" = TRUE)
would return only the RMSE metric.
The list can be created with gaussian_metrics() or binomial_metrics().
Also accepts the string "all".
preprocessing: Name of preprocessing to apply.
Available preprocessings are:
Name
Description
"standardize"
Centers and scales the numeric predictors.
"range"
Normalizes the numeric predictors to the 0 - 1 range. Values outside the min/max range in the test fold are truncated to 0 / 1 .
"scale"
Scales the numeric predictors to have a standard deviation of one.
"center"
Centers the numeric predictors to have a mean of zero.
The preprocessing parameters (mean, SD, etc.) are extracted from the training folds and applied to both the training folds and the test fold. They are returned in the Preprocess column for inspection.
N.B. The preprocessings should not affect the results to a noticeable degree, although "range" might due to the truncation.
err_nc: Whether to raise an error if a model does not converge. (Logical)
rm_nc: Remove non-converged models from output. (Logical)
parallel: Whether to validate the list of models in parallel. (Logical)
Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.
verbose: Whether to message process information like the number of model instances to fit and which model function was applied. (Logical)
link, models, model_verbose: Deprecated.
Returns
tibble with the results and model objects.
Shared across families
A nested tibble with coefficients of the models from all iterations.
Count of convergence warnings . Consider discarding models that did not converge.
Count of other warnings . These are warnings without keywords such as "convergence".
Count of Singular Fit messages . See lme4::isSingular for more information.
Nested tibble with the warnings and messages caught for each model.
Specified family .
Nested model objects.
Name of dependent variable.
Names of fixed effects.
Names of random effects, if any.
Nested tibble with preprocess ing parameters, if any.
See the additional metrics (disabled by default) at ?binomial_metrics.
Also includes:
A nested tibble with predictions , predicted classes (depends on cutoff), and the targets. Note, that the predictions are not necessarily of the specifiedpositive class, but of the model's positive class (second level of dependent variable, alphabetically).
The pROC::roc ‘ROC’ curve object(s).
A nested tibble with the confusion matrix /matrices. The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. I.e. the level you wish to predict.
The name of the Positive Class .
Details
Packages used:
Models
Gaussian: stats::lm, lme4::lmer
Binomial: stats::glm, lme4::glmer
Results
Shared
AIC : stats::AIC
AICc : MuMIn::AICc
BIC : stats::BIC
Gaussian
r2m : MuMIn::r.squaredGLMM
r2c : MuMIn::r.squaredGLMM
Binomial
ROC and AUC: pROC::roc
Examples
# Attach packageslibrary(cvms)library(groupdata2)# partition()library(dplyr)# %>% arrange()# Data is part of cvmsdata <- participant.scores
# Set seed for reproducibilityset.seed(7)# Partition data# Keep as single data frame# We could also have fed validate() separate train and test sets.data_partitioned <- partition( data, p =0.7, cat_col ="diagnosis", id_col ="participant", list_out =FALSE)%>% arrange(.partitions)# Validate a model# Gaussianvalidate( data_partitioned, formulas ="score~diagnosis", partitions_col =".partitions", family ="gaussian", REML =FALSE)# Binomialvalidate(data_partitioned, formulas ="diagnosis~score", partitions_col =".partitions", family ="binomial")## Feed separate train and test sets# Partition data to list of data frames# The first data frame will be train (70% of the data)# The second will be test (30% of the data)data_partitioned <- partition( data, p =0.7, cat_col ="diagnosis", id_col ="participant", list_out =TRUE)train_data <- data_partitioned[[1]]test_data <- data_partitioned[[2]]# Validate a model# Gaussianvalidate( train_data, test_data = test_data, formulas ="score~diagnosis", family ="gaussian", REML =FALSE)
See Also
Other validation functions: cross_validate(), cross_validate_fn(), validate_fn()