plot_probabilities_ecdf() R function from [cvms]

Plot ECDF for the predicted probabilities

lifecycle::badge("experimental")

Plots the empirical cumulative distribution function (ECDF) for the probabilities of either the target classes or the predicted classes.

Creates a ggplot2 with the stat_ecdf() geom.


plot_probabilities_ecdf(
  data,
  target_col,
  probability_cols,
  predicted_class_col = NULL,
  obs_id_col = NULL,
  group_col = NULL,
  probability_of = "target",
  positive = 2,
  theme_fn = ggplot2::theme_minimal,
  color_scale = ggplot2::scale_colour_brewer(palette = "Dark2"),
  apply_facet = length(probability_cols) > 1,
  add_caption = TRUE,
  ecdf_settings = list(),
  facet_settings = list(),
  xlim = c(0, 1)
)

Arguments

data: data.frame with probabilities, target classes and (optional) predicted classes. Can also include observation identifiers and a grouping variable.

Example for binary classification:

Classifier Observation Probability Target Prediction
SVM 1 0.3 cl_1 cl_1
SVM 2 0.7 cl_1 cl_2
NB 1 0.2 cl_2 cl_1
NB 2 0.8 cl_2 cl_2
... ... ... ... ...

Example for multiclass classification:

Classifier Observation cl_1 cl_2 cl_3 Target Prediction
SVM 1 0.2 0.1 0.7 cl_1 cl_3
SVM 2 0.3 0.5 0.2 cl_1 cl_2
NB 1 0.8 0.1 0.1 cl_2 cl_1
NB 2 0.1 0.6 0.3 cl_3 cl_2
... ... ... ... ... ... ...

As created with the various validation functions in cvms, like cross_validate_fn().
target_col: Name of column with target levels.
probability_cols: Name of columns with predicted probabilities.

For binary classification, this should be one column with the probability of the second class (alphabetically).

For multiclass classification, this should be one column per class . These probabilities must sum to 1 row-wise.
predicted_class_col: Name of column with predicted classes.

This is required when probability_of = "prediction".
obs_id_col: Name of column with observation identifiers for averaging the predicted probabilities per observation before computing the ECDF (when deemed meaningful). When NULL, each row is an observation.
group_col: Name of column with groups. The plot elements are split by these groups and can be identified by their color.

E.g. the classifier responsible for the prediction.

N.B. With more than ‘8’ groups, the default color_scale might run out of colors.
probability_of: Whether to plot the ECDF for the probabilities of the target classes ("target") or the predicted classes ("prediction").

For each row, we extract the probability of either the target class or the predicted class. Both are useful to plot, as they show the behavior of the classifier in a way a confusion matrix doesn't. One classifier might be very certain in its predictions (whether wrong or right), whereas another might be less certain.
positive: TODO
theme_fn: The ggplot2 theme function to apply.
color_scale: ggplot2 color scale object for adding discrete colors to the plot.

E.g. the output of ggplot2::scale_colour_brewer() or ggplot2::scale_colour_viridis_d().

N.B. The number of colors in the object's palette should be at least the same as the number of groups in the group_col column.
apply_facet: Whether to use ggplot2::facet_wrap(). (Logical)

By default, faceting is applied when there are more than one probability column (multiclass).
add_caption: Whether to add a caption explaining the plot. This is dynamically generated and intended as a starting point. (Logical)

You can overwrite the text with ggplot2::labs(caption = "...").
ecdf_settings: Named list of arguments for ggplot2::stat_ecdf().

The mapping argument is set separately.

Any argument not in the list will use the default value set by cvms.

Defaults: list(geom = "smooth", pad = FALSE).

Common changes are to set geom = "step" and/or pad = TRUE.
facet_settings: Named list of arguments for ggplot2::facet_wrap().

The facets argument is set separately.

Any argument not in the list will use its default value.

Commonly set arguments are nrow and ncol.
xlim: Limits for the x-scale.


Classifier	Observation	Probability	Target	Prediction
SVM	1	0.3	cl_1	cl_1
SVM	2	0.7	cl_1	cl_2
NB	1	0.2	cl_2	cl_1
NB	2	0.8	cl_2	cl_2
...	...	...	...	...


Classifier	Observation	cl_1	cl_2	cl_3	Target	Prediction
SVM	1	0.2	0.1	0.7	cl_1	cl_3
SVM	2	0.3	0.5	0.2	cl_1	cl_2
NB	1	0.8	0.1	0.1	cl_2	cl_1
NB	2	0.1	0.6	0.3	cl_3	cl_2
...	...	...	...	...	...	...

Returns

A ggplot2 object with a faceted line plot. TODO

Details

TODO

Examples


# Attach cvms
library(cvms)
library(ggplot2)
library(dplyr)

#
# Multiclass
#

# TODO: Go through and rewrite comments and code!

# Plot probabilities of target classes
# From repeated cross-validation of three classifiers

# plot_probabilities_ecdf(
#   data = predicted.musicians,
#   target_col = "Target",
#   probability_cols = c("A", "B", "C", "D"),
#   predicted_class_col = "Predicted Class",
#   group_col = "Classifier",
#   probability_of = "target"
# )

# Plot probabilities of predicted classes
# From repeated cross-validation of three classifiers

# plot_probabilities_ecdf(
#   data = predicted.musicians,
#   target_col = "Target",
#   probability_cols = c("A", "B", "C", "D"),
#   predicted_class_col = "Predicted Class",
#   group_col = "Classifier",
#   probability_of = "prediction"
# )

#
# Binary
#

# Filter the predicted.musicians dataset
# binom_data <- predicted.musicians %>%
#   dplyr::filter(
#     Target %in% c("A", "B")
#   ) %>%
#   # "B" is the second class alphabetically
#   dplyr::rename(Probability = B) %>%
#   dplyr::mutate(`Predicted Class` = ifelse(
#     Probability > 0.5, "B", "A")) %>%
#   dplyr::select(-dplyr::all_of(c("A","C","D")))

# Plot probabilities of predicted classes
# From repeated cross-validation of three classifiers

# plot_probabilities_ecdf(
#   data = binom_data,
#   target_col = "Target",
#   probability_cols = "Probability",
#   predicted_class_col = "Predicted Class",
#   group_col = "Classifier",
#   probability_of = "target"
# )

# plot_probabilities_ecdf(
#   data = binom_data,
#   target_col = "Target",
#   probability_cols = "Probability",
#   predicted_class_col = "Predicted Class",
#   group_col = "Classifier",
#   probability_of = "prediction",
#   xlim = c(0.5, 1)
# )

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

cvms package Read PDF manual

Maintainer: Ludvig Renbo Olsen
License: MIT + file LICENSE
Last published: 2025-03-07

Useful links

plot_probabilities_ecdf function