evaluate function

Evaluate your model's performance

Evaluate your model's performance

lifecycle::badge("maturing")

Evaluate your model's predictions on a set of evaluation metrics.

Create ID-aggregated evaluations by multiple methods.

Currently supports regression and classification (binary and multiclass). See type.

evaluate( data, target_col, prediction_cols, type, id_col = NULL, id_method = "mean", apply_softmax = FALSE, cutoff = 0.5, positive = 2, metrics = list(), include_predictions = TRUE, parallel = FALSE, models = deprecated() )

Arguments

  • data: data.frame with predictions, targets and (optionally) an ID column. Can be grouped with group_by.

    Multinomial

    When type is "multinomial", the predictions can be passed in one of two formats.

    Probabilities (Preferable)

    One column per class with the probability of that class. The columns should have the name of their class, as they are named in the target column. E.g.:

    class_1class_2class_3target
    0.2690.5280.203class_2
    0.3680.3220.310class_3
    0.3750.3710.254class_2
    ............

    Classes

    A single column of type character with the predicted classes. E.g.:

    predictiontarget
    class_2class_2
    class_1class_3
    class_1class_2
    ......

    Binomial

    When type is "binomial", the predictions can be passed in one of two formats.

    Probabilities (Preferable)

    One column with the probability of class being the second class alphabetically

    (1 if classes are 0 and 1). E.g.:

    predictiontarget
    0.7691
    0.3681
    0.3750
    ......

    Note: At the alphabetical ordering of the class labels, they are of type character, why e.g. 100 would come before 7.

    Classes

    A single column of type character with the predicted classes. E.g.:

    predictiontarget
    class_0class_1
    class_1class_1
    class_1class_0
    ......

    Note: The prediction column will be converted to the probability 0.0

    for the first class alphabetically and 1.0 for the second class alphabetically.

    Gaussian

    When type is "gaussian", the predictions should be passed as one column with the predicted values. E.g.:

    predictiontarget
    28.930.2
    33.227.1
    23.421.3
    ......
  • target_col: Name of the column with the true classes/values in data.

    When type is "multinomial", this column should contain the class names, not their indices.

  • prediction_cols: Name(s) of column(s) with the predictions.

    Columns can be either numeric or character depending on which format is chosen. See data for the possible formats.

  • type: Type of evaluation to perform:

    "gaussian" for regression (like linear regression).

    "binomial" for binary classification.

    "multinomial" for multiclass classification.

  • id_col: Name of ID column to aggregate predictions by.

    N.B. Current methods assume that the target class/value is constant within the IDs.

    N.B. When aggregating by ID, some metrics may be disabled.

  • id_method: Method to use when aggregating predictions by ID. Either "mean" or "majority".

    When type is gaussian, only the "mean" method is available.

    mean

    The average prediction (value or probability) is calculated per ID and evaluated. This method assumes that the target class/value is constant within the IDs.

    majority

    The most predicted class per ID is found and evaluated. In case of a tie, the winning classes share the probability (e.g. P = 0.5 each when two majority classes). This method assumes that the target class/value is constant within the IDs.

  • apply_softmax: Whether to apply the softmax function to the prediction columns when type is "multinomial".

    N.B. Multinomial models only .

  • cutoff: Threshold for predicted classes. (Numeric)

    N.B. Binomial models only .

  • positive: Level from dependent variable to predict. Either as character (preferable) or level index (1 or 2 - alphabetically).

    E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".

    Note: For reproducibility, it's preferable to specify the name directly , as different locales may sort the levels differently.

    Used when calculating confusion matrix metrics and creating ROC curves.

    The Process column in the output can be used to verify this setting.

    N.B. Only affects the evaluation metrics. Does NOT affect what the probabilities are of (always the second classalphabetically).

    N.B. Binomial models only .

  • metrics: list for enabling/disabling metrics.

    E.g. list("RMSE" = FALSE) would remove RMSE from the regression results, and list("Accuracy" = TRUE) would add the regular Accuracy metric to the classification results. Default values (TRUE/FALSE) will be used for the remaining available metrics.

    You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why f.i. list("all" = FALSE, "RMSE" = TRUE)

    would return only the RMSE metric.

    The list can be created with gaussian_metrics(), binomial_metrics(), or multinomial_metrics().

    Also accepts the string "all".

  • include_predictions: Whether to include the predictions in the output as a nested tibble. (Logical)

  • parallel: Whether to run evaluations in parallel, when data is grouped with group_by.

  • models: Deprecated.

Returns


Gaussian Results


tibble containing the following metrics by default:

Average ‘RMSE’ , ‘MAE’ , ‘NRMSE(IQR)’ , ‘RRSE’ , ‘RAE’ , ‘RMSLE’ .

See the additional metrics (disabled by default) at ?gaussian_metrics.

Also includes:

A nested tibble with the Predictions and targets.

A nested Process information object with information about the evaluation.


Binomial Results


tibble with the following evaluation metrics, based on a confusion matrix and a ROC curve fitted to the predictions:

Confusion Matrix:

‘Balanced Accuracy’ , ‘Accuracy’ , ‘F1’ , ‘Sensitivity’ , ‘Specificity’ , ‘Positive Predictive Value’ , ‘Negative Predictive Value’ , ‘Kappa’ , ‘Detection Rate’ , ‘Detection Prevalence’ , ‘Prevalence’ , and ‘MCC’ (Matthews correlation coefficient).

ROC:

‘AUC’ , ‘Lower CI’ , and ‘Upper CI’

Note, that the ROC curve is only computed if AUC is enabled. See metrics.

Also includes:

A nested tibble with the predictions and targets.

A list of ROC curve objects (if computed).

A nested tibble with the confusion matrix . The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. I.e. the level you wish to predict.

A nested Process information object with information about the evaluation.


Multinomial Results


For each class, a one-vs-all binomial evaluation is performed. This creates a Class Level Results tibble containing the same metrics as the binomial results described above (excluding Accuracy, MCC, AUC, Lower CI and Upper CI), along with a count of the class in the target column (‘Support’ ). These metrics are used to calculate the macro-averaged metrics. The nested class level results tibble is also included in the output tibble, and could be reported along with the macro and overall metrics.

The output tibble contains the macro and overall metrics. The metrics that share their name with the metrics in the nested class level results tibble are averages of those metrics (note: does not remove NAs before averaging). In addition to these, it also includes the ‘Overall Accuracy’ and the multiclass ‘MCC’ .

Note: ‘Balanced Accuracy’ is the macro-averaged metric, not the macro sensitivity as sometimes used!

Other available metrics (disabled by default, see metrics): ‘Accuracy’ , multiclass ‘AUC’ , ‘Weighted Balanced Accuracy’ , ‘Weighted Accuracy’ , ‘Weighted F1’ , ‘Weighted Sensitivity’ , ‘Weighted Sensitivity’ , ‘Weighted Specificity’ , ‘Weighted Pos Pred Value’ , ‘Weighted Neg Pred Value’ , ‘Weighted Kappa’ , ‘Weighted Detection Rate’ , ‘Weighted Detection Prevalence’ , and ‘Weighted Prevalence’ .

Note that the "Weighted" average metrics are weighted by the Support.

When having a large set of classes, consider keeping AUC disabled.

Also includes:

A nested tibble with the Predictions and targets.

A list of ROC curve objects when AUC is enabled.

A nested tibble with the multiclass Confusion Matrix .

A nested Process information object with information about the evaluation.

Class Level Results

Besides the binomial evaluation metrics and the Support, the nested class level results tibble also contains a nested tibble with the Confusion Matrix from the one-vs-all evaluation. The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. In our case, 1 is the current class and 0 represents all the other classes together.

Details

Packages used:

Binomial and Multinomial :

ROC and AUC:

Binomial: pROC::roc

Multinomial: pROC::multiclass.roc

Examples

# Attach packages library(cvms) library(dplyr) # Load data data <- participant.scores # Fit models gaussian_model <- lm(age ~ diagnosis, data = data) binomial_model <- glm(diagnosis ~ score, data = data) # Add predictions data[["gaussian_predictions"]] <- predict(gaussian_model, data, type = "response", allow.new.levels = TRUE ) data[["binomial_predictions"]] <- predict(binomial_model, data, allow.new.levels = TRUE ) # Gaussian evaluation evaluate( data = data, target_col = "age", prediction_cols = "gaussian_predictions", type = "gaussian" ) # Binomial evaluation evaluate( data = data, target_col = "diagnosis", prediction_cols = "binomial_predictions", type = "binomial" ) # # Multinomial # # Create a tibble with predicted probabilities and targets data_mc <- multiclass_probability_tibble( num_classes = 3, num_observations = 45, apply_softmax = TRUE, FUN = runif, class_name = "class_", add_targets = TRUE ) class_names <- paste0("class_", 1:3) # Multinomial evaluation evaluate( data = data_mc, target_col = "Target", prediction_cols = class_names, type = "multinomial" ) # # ID evaluation # # Gaussian ID evaluation # Note that 'age' is the same for all observations # of a participant evaluate( data = data, target_col = "age", prediction_cols = "gaussian_predictions", id_col = "participant", type = "gaussian" ) # Binomial ID evaluation evaluate( data = data, target_col = "diagnosis", prediction_cols = "binomial_predictions", id_col = "participant", id_method = "mean", # alternatively: "majority" type = "binomial" ) # Multinomial ID evaluation # Add IDs and new targets (must be constant within IDs) data_mc[["Target"]] <- NULL data_mc[["ID"]] <- rep(1:9, each = 5) id_classes <- tibble::tibble( "ID" = 1:9, "Target" = sample(x = class_names, size = 9, replace = TRUE) ) data_mc <- data_mc %>% dplyr::left_join(id_classes, by = "ID") # Perform ID evaluation evaluate( data = data_mc, target_col = "Target", prediction_cols = class_names, id_col = "ID", id_method = "mean", # alternatively: "majority" type = "multinomial" ) # # Training and evaluating a multinomial model with nnet # # Only run if `nnet` is installed if (requireNamespace("nnet", quietly = TRUE)){ # Create a data frame with some predictors and a target column class_names <- paste0("class_", 1:4) data_for_nnet <- multiclass_probability_tibble( num_classes = 3, # Here, number of predictors num_observations = 30, apply_softmax = FALSE, FUN = rnorm, class_name = "predictor_" ) %>% dplyr::mutate(Target = sample( class_names, size = 30, replace = TRUE )) # Train multinomial model using the nnet package mn_model <- nnet::multinom( "Target ~ predictor_1 + predictor_2 + predictor_3", data = data_for_nnet ) # Predict the targets in the dataset # (we would usually use a test set instead) predictions <- predict( mn_model, data_for_nnet, type = "probs" ) %>% dplyr::as_tibble() # Add the targets predictions[["Target"]] <- data_for_nnet[["Target"]] # Evaluate predictions evaluate( data = predictions, target_col = "Target", prediction_cols = class_names, type = "multinomial" ) }

See Also

Other evaluation functions: binomial_metrics(), confusion_matrix(), evaluate_residuals(), gaussian_metrics(), multinomial_metrics()

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

  • Maintainer: Ludvig Renbo Olsen
  • License: MIT + file LICENSE
  • Last published: 2025-03-07