evaluate() R function from [cvms]

Evaluate your model's performance

lifecycle::badge("maturing")

Evaluate your model's predictions on a set of evaluation metrics.

Create ID-aggregated evaluations by multiple methods.

Currently supports regression and classification (binary and multiclass). See type.


evaluate(
  data,
  target_col,
  prediction_cols,
  type,
  id_col = NULL,
  id_method = "mean",
  apply_softmax = FALSE,
  cutoff = 0.5,
  positive = 2,
  metrics = list(),
  include_predictions = TRUE,
  parallel = FALSE,
  models = deprecated()
)

Arguments

data: data.frame with predictions, targets and (optionally) an ID column. Can be grouped with group_by.

Multinomial

When type is "multinomial", the predictions can be passed in one of two formats.

Probabilities (Preferable)

One column per class with the probability of that class. The columns should have the name of their class, as they are named in the target column. E.g.:

class_1 class_2 class_3 target
0.269 0.528 0.203 class_2
0.368 0.322 0.310 class_3
0.375 0.371 0.254 class_2
... ... ... ...

Classes

A single column of type character with the predicted classes. E.g.:

prediction target
class_2 class_2
class_1 class_3
class_1 class_2
... ...

Binomial

When type is "binomial", the predictions can be passed in one of two formats.

Probabilities (Preferable)

One column with the probability of class being the second class alphabetically

(1 if classes are 0 and 1). E.g.:

prediction target
0.769 1
0.368 1
0.375 0
... ...

Note: At the alphabetical ordering of the class labels, they are of type character, why e.g. 100 would come before 7.

Classes

A single column of type character with the predicted classes. E.g.:

prediction target
class_0 class_1
class_1 class_1
class_1 class_0
... ...

Note: The prediction column will be converted to the probability 0.0

for the first class alphabetically and 1.0 for the second class alphabetically.

Gaussian

When type is "gaussian", the predictions should be passed as one column with the predicted values. E.g.:

prediction target
28.9 30.2
33.2 27.1
23.4 21.3
... ...
target_col: Name of the column with the true classes/values in data.

When type is "multinomial", this column should contain the class names, not their indices.
prediction_cols: Name(s) of column(s) with the predictions.

Columns can be either numeric or character depending on which format is chosen. See data for the possible formats.
type: Type of evaluation to perform:

"gaussian" for regression (like linear regression).

"binomial" for binary classification.

"multinomial" for multiclass classification.
id_col: Name of ID column to aggregate predictions by.

N.B. Current methods assume that the target class/value is constant within the IDs.

N.B. When aggregating by ID, some metrics may be disabled.
id_method: Method to use when aggregating predictions by ID. Either "mean" or "majority".

When type is gaussian, only the "mean" method is available.

mean

The average prediction (value or probability) is calculated per ID and evaluated. This method assumes that the target class/value is constant within the IDs.

majority

The most predicted class per ID is found and evaluated. In case of a tie, the winning classes share the probability (e.g. P = 0.5 each when two majority classes). This method assumes that the target class/value is constant within the IDs.
apply_softmax: Whether to apply the softmax function to the prediction columns when type is "multinomial".

N.B. Multinomial models only .
cutoff: Threshold for predicted classes. (Numeric)

N.B. Binomial models only .
positive: Level from dependent variable to predict. Either as character (preferable) or level index (1 or 2 - alphabetically).

E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".

Note: For reproducibility, it's preferable to specify the name directly , as different locales may sort the levels differently.

Used when calculating confusion matrix metrics and creating ROC curves.

The Process column in the output can be used to verify this setting.

N.B. Only affects the evaluation metrics. Does NOT affect what the probabilities are of (always the second classalphabetically).

N.B. Binomial models only .
metrics: list for enabling/disabling metrics.

E.g. list("RMSE" = FALSE) would remove RMSE from the regression results, and list("Accuracy" = TRUE) would add the regular Accuracy metric to the classification results. Default values (TRUE/FALSE) will be used for the remaining available metrics.

You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why f.i. list("all" = FALSE, "RMSE" = TRUE)

would return only the RMSE metric.

The list can be created with gaussian_metrics(), binomial_metrics(), or multinomial_metrics().

Also accepts the string "all".
include_predictions: Whether to include the predictions in the output as a nested tibble. (Logical)
parallel: Whether to run evaluations in parallel, when data is grouped with group_by.
models: Deprecated.


class_1	class_2	class_3	target
0.269	0.528	0.203	class_2
0.368	0.322	0.310	class_3
0.375	0.371	0.254	class_2
...	...	...	...

Returns

Gaussian Results

tibble containing the following metrics by default:

Average ‘RMSE’ , ‘MAE’ , ‘NRMSE(IQR)’ , ‘RRSE’ , ‘RAE’ , ‘RMSLE’ .

See the additional metrics (disabled by default) at ?gaussian_metrics.

Also includes:

A nested tibble with the Predictions and targets.

A nested Process information object with information about the evaluation.

Binomial Results

tibble with the following evaluation metrics, based on a confusion matrix and a ROC curve fitted to the predictions:

Confusion Matrix:

‘Balanced Accuracy’ , ‘Accuracy’ , ‘F1’ , ‘Sensitivity’ , ‘Specificity’ , ‘Positive Predictive Value’ , ‘Negative Predictive Value’ , ‘Kappa’ , ‘Detection Rate’ , ‘Detection Prevalence’ , ‘Prevalence’ , and ‘MCC’ (Matthews correlation coefficient).

ROC:

‘AUC’ , ‘Lower CI’ , and ‘Upper CI’

Note, that the ROC curve is only computed if AUC is enabled. See metrics.

Also includes:

A nested tibble with the predictions and targets.

A list of ROC curve objects (if computed).

A nested tibble with the confusion matrix . The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. I.e. the level you wish to predict.

A nested Process information object with information about the evaluation.

Multinomial Results

For each class, a one-vs-all binomial evaluation is performed. This creates a Class Level Results tibble containing the same metrics as the binomial results described above (excluding Accuracy, MCC, AUC, Lower CI and Upper CI), along with a count of the class in the target column (‘Support’ ). These metrics are used to calculate the macro-averaged metrics. The nested class level results tibble is also included in the output tibble, and could be reported along with the macro and overall metrics.

The output tibble contains the macro and overall metrics. The metrics that share their name with the metrics in the nested class level results tibble are averages of those metrics (note: does not remove NAs before averaging). In addition to these, it also includes the ‘Overall Accuracy’ and the multiclass ‘MCC’ .

Note: ‘Balanced Accuracy’ is the macro-averaged metric, not the macro sensitivity as sometimes used!

Other available metrics (disabled by default, see metrics): ‘Accuracy’ , multiclass ‘AUC’ , ‘Weighted Balanced Accuracy’ , ‘Weighted Accuracy’ , ‘Weighted F1’ , ‘Weighted Sensitivity’ , ‘Weighted Sensitivity’ , ‘Weighted Specificity’ , ‘Weighted Pos Pred Value’ , ‘Weighted Neg Pred Value’ , ‘Weighted Kappa’ , ‘Weighted Detection Rate’ , ‘Weighted Detection Prevalence’ , and ‘Weighted Prevalence’ .

Note that the "Weighted" average metrics are weighted by the Support.

When having a large set of classes, consider keeping AUC disabled.

Also includes:

A nested tibble with the Predictions and targets.

A list of ROC curve objects when AUC is enabled.

A nested tibble with the multiclass Confusion Matrix .

A nested Process information object with information about the evaluation.

Class Level Results

Besides the binomial evaluation metrics and the Support, the nested class level results tibble also contains a nested tibble with the Confusion Matrix from the one-vs-all evaluation. The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. In our case, 1 is the current class and 0 represents all the other classes together.

Details

Packages used:

Binomial and Multinomial :

ROC and AUC:

Binomial: pROC::roc

Multinomial: pROC::multiclass.roc

Examples


# Attach packages
library(cvms)
library(dplyr)

# Load data
data <- participant.scores

# Fit models
gaussian_model <- lm(age ~ diagnosis, data = data)
binomial_model <- glm(diagnosis ~ score, data = data)

# Add predictions
data[["gaussian_predictions"]] <- predict(gaussian_model, data,
  type = "response",
  allow.new.levels = TRUE
)
data[["binomial_predictions"]] <- predict(binomial_model, data,
  allow.new.levels = TRUE
)

# Gaussian evaluation
evaluate(
  data = data, target_col = "age",
  prediction_cols = "gaussian_predictions",
  type = "gaussian"
)

# Binomial evaluation
evaluate(
  data = data, target_col = "diagnosis",
  prediction_cols = "binomial_predictions",
  type = "binomial"
)

#
# Multinomial
#

# Create a tibble with predicted probabilities and targets
data_mc <- multiclass_probability_tibble(
  num_classes = 3, num_observations = 45,
  apply_softmax = TRUE, FUN = runif,
  class_name = "class_",
  add_targets = TRUE
)

class_names <- paste0("class_", 1:3)

# Multinomial evaluation
evaluate(
  data = data_mc, target_col = "Target",
  prediction_cols = class_names,
  type = "multinomial"
)

#
# ID evaluation
#

# Gaussian ID evaluation
# Note that 'age' is the same for all observations
# of a participant
evaluate(
  data = data, target_col = "age",
  prediction_cols = "gaussian_predictions",
  id_col = "participant",
  type = "gaussian"
)

# Binomial ID evaluation
evaluate(
  data = data, target_col = "diagnosis",
  prediction_cols = "binomial_predictions",
  id_col = "participant",
  id_method = "mean", # alternatively: "majority"
  type = "binomial"
)

# Multinomial ID evaluation

# Add IDs and new targets (must be constant within IDs)
data_mc[["Target"]] <- NULL
data_mc[["ID"]] <- rep(1:9, each = 5)
id_classes <- tibble::tibble(
  "ID" = 1:9,
  "Target" = sample(x = class_names, size = 9, replace = TRUE)
)
data_mc <- data_mc %>%
  dplyr::left_join(id_classes, by = "ID")

# Perform ID evaluation
evaluate(
  data = data_mc, target_col = "Target",
  prediction_cols = class_names,
  id_col = "ID",
  id_method = "mean", # alternatively: "majority"
  type = "multinomial"
)

#
# Training and evaluating a multinomial model with nnet
#

# Only run if `nnet` is installed
if (requireNamespace("nnet", quietly = TRUE)){

# Create a data frame with some predictors and a target column
class_names <- paste0("class_", 1:4)
data_for_nnet <- multiclass_probability_tibble(
  num_classes = 3, # Here, number of predictors
  num_observations = 30,
  apply_softmax = FALSE,
  FUN = rnorm,
  class_name = "predictor_"
) %>%
  dplyr::mutate(Target = sample(
    class_names,
    size = 30,
    replace = TRUE
  ))

# Train multinomial model using the nnet package
mn_model <- nnet::multinom(
  "Target ~ predictor_1 + predictor_2 + predictor_3",
  data = data_for_nnet
)

# Predict the targets in the dataset
# (we would usually use a test set instead)
predictions <- predict(
  mn_model,
  data_for_nnet,
  type = "probs"
) %>%
  dplyr::as_tibble()

# Add the targets
predictions[["Target"]] <- data_for_nnet[["Target"]]

# Evaluate predictions
evaluate(
  data = predictions,
  target_col = "Target",
  prediction_cols = class_names,
  type = "multinomial"
)
}

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

cvms package Read PDF manual

Maintainer: Ludvig Renbo Olsen
License: MIT + file LICENSE
Last published: 2025-03-07

Useful links

evaluate function

Evaluate your model's performance

Arguments

Multinomial

Probabilities (Preferable)

Classes

Binomial

Probabilities (Preferable)

Classes

Gaussian

mean

majority

Returns

Gaussian Results

Binomial Results

Multinomial Results

Class Level Results

Details

Examples

See Also

Author(s)


prediction	target
class_2	class_2
class_1	class_3
class_1	class_2
...	...


prediction	target
class_0	class_1
class_1	class_1
class_1	class_0
...	...


prediction	target
28.9	30.2
33.2	27.1
23.4	21.3
...	...