series: A list of series, a numeric matrix or a data frame. Matrices and data frames are coerced to a list row-wise (see tslist()).
types: Clustering types. It must be any combination of (possibly abbreviated): "partitional", "hierarchical", "fuzzy", "tadpole."
configs: The list of data frames with the desired configurations to run. See pdc_configs() and compare_clusterings_configs().
seed: Seed for random reproducibility.
trace: Logical indicating that more output should be printed to screen.
...: Further arguments for tsclust(), score.clus or pick.clus.
score.clus: A function that gets the list of results (and ...) and scores each one. It may also be a named list of functions, one for each type of clustering. See Scoring section.
pick.clus: A function to pick the best result. See Picking section.
shuffle.configs: Randomly shuffle the order of configs, which can be useful to balance load when using parallel computation.
return.objects: Logical indicating whether the objects returned by tsclust() should be given in the result.
packages: A character vector with the names of any packages needed for any functions used (distance, centroid, preprocessing, etc.). The name "dtwclust" is added automatically. Relevant for parallel computation.
.errorhandling: This will be passed to foreach::foreach(). See Parallel section below.
Returns
A list with:
results: A list of data frames with the flattened configs and the corresponding scores returned by score.clus.
scores: The scores given by score.clus.
pick: The object returned by pick.clus.
proc_time: The measured execution time, using base::proc.time().
seeds: A list of lists with the random seeds computed for each configuration.
The cluster objects are also returned if return.objects=TRUE.
Details
This function calls tsclust() with different configurations and evaluates the results with the provided functions. Parallel support is included. See the examples.
Parameters specified in configs whose values are NA will be ignored automatically.
The scoring and picking functions are for convenience, if they are not specified, the scores
and pick elements of the result will be NULL.
See repeat_clustering() for when return.objects = FALSE.
Parallel computation
The configurations for each clustering type can be evaluated in parallel (multi-processing) with the foreach package. A parallel backend can be registered, e.g., with doParallel.
If the .errorhandling parameter is changed to "pass" and a custom score.clus function is used, said function should be able to deal with possible error objects.
If it is changed to "remove", it might not be possible to attach the scores to the results data frame, or it may be inconsistent. Additionally, if return.objects is TRUE, the names given to the objects might also be inconsistent.
Parallelization can incur a lot of deep copies of data when returning the cluster objects, since each one will contain a copy of datalist. If you want to avoid this, consider specifying score.clus and setting return.objects to FALSE, and then using repeat_clustering().
Scoring
The clustering results are organized in a list of lists in the following way (where only applicable types exist; first-level list names in bold):
partitional - list with
Clustering results from first partitional config
etc.
hierarchical - list with
Clustering results from first hierarchical config
etc.
fuzzy - list with
Clustering results from first fuzzy config
etc.
tadpole - list with
Clustering results from first tadpole config
etc.
If score.clus is a function, it will be applied to the available partitional, hierarchical, fuzzy and/or tadpole results via:
scores <- lapply(list_of_lists, score.clus, ...)
Otherwise, score.clus should be a list of functions with the same names as the list above, so
that score.clus$partitional is used to score list_of_lists$partitionaland so on (via base::Map()).
Therefore, the scores returned shall always be a list of lists with first-level names as above.
Picking
If return.objects is TRUE, the results' data frames and the list of TSClusters
objects are given to pick.clus as first and second arguments respectively, followed by .... Otherwise, pick.clus will receive only the data frames and the contents of ... (since the objects will not be returned by the preceding step).
Limitations
Note that the configurations returned by the helper functions assign special names to preprocessing/distance/centroid arguments, and these names are used internally to recognize them.
If some of these arguments are more complex (e.g. matrices) and should not be expanded, consider passing them directly via the ellipsis (...) instead of using pdc_configs(). This assumes that said arguments can be passed to all functions without affecting their results.
The distance matrices (if calculated) are not re-used across configurations. Given the way the configurations are created, this shouldn't matter, because clusterings with arguments that can use the same distance matrix are already grouped together by compare_clusterings_configs()
and pdc_configs().
Examples
# Fuzzy preprocessing: calculate autocorrelation up to 50th lagacf_fun <-function(series,...){ lapply(series,function(x){ as.numeric(acf(x, lag.max =50, plot =FALSE)$acf)})}# Define overall configurationcfgs <- compare_clusterings_configs( types = c("p","h","f","t"), k =19L:20L, controls = list( partitional = partitional_control( iter.max =30L, nrep =1L), hierarchical = hierarchical_control( method ="all"), fuzzy = fuzzy_control(# notice the vector fuzziness = c(2,2.5), iter.max =30L), tadpole = tadpole_control(# notice the vectors dc = c(1.5,2), window.size =19L:20L)), preprocs = pdc_configs( type ="preproc",# shared none = list(), zscore = list(center = c(FALSE)),# only for fuzzy fuzzy = list( acf_fun = list()),# only for tadpole tadpole = list( reinterpolate = list(new.length =205L)),# specify which should consider the shared ones share.config = c("p","h")), distances = pdc_configs( type ="distance", sbd = list(), fuzzy = list( L2 = list()), share.config = c("p","h")), centroids = pdc_configs( type ="centroid", partitional = list( pam = list()),# special name 'default' hierarchical = list( default = list()), fuzzy = list( fcmdd = list()), tadpole = list( default = list(), shape_extraction = list(znorm =TRUE))))# Number of configurations is returned as attributenum_configs <- sapply(cfgs, attr, which ="num.configs")cat("\nTotal number of configurations without considering optimizations:", sum(num_configs),"\n\n")# Define evaluation functions based on CVI: Variation of Information (only crisp partition)vi_evaluators <- cvi_evaluators("VI", ground.truth = CharTrajLabels)score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick
# ====================================================================================# Short run with only fuzzy clustering# ====================================================================================comparison_short <- compare_clusterings(CharTraj, types = c("f"), configs = cfgs, seed =293L, trace =TRUE, score.clus = score_fun, pick.clus = pick_fun, return.objects =TRUE)## Not run:# ====================================================================================# Parallel run with all comparisons# ====================================================================================require(doParallel)registerDoParallel(cl <- makeCluster(detectCores()))comparison_long <- compare_clusterings(CharTraj, types = c("p","h","f","t"), configs = cfgs, seed =293L, trace =TRUE, score.clus = score_fun, pick.clus = pick_fun, return.objects =TRUE)# Using all external CVIs and majority voteexternal_evaluators <- cvi_evaluators("external", ground.truth = CharTrajLabels)score_external <- external_evaluators$score
pick_majority <- external_evaluators$pick
comparison_majority <- compare_clusterings(CharTraj, types = c("p","h","f","t"), configs = cfgs, seed =84L, trace =TRUE, score.clus = score_external, pick.clus = pick_majority, return.objects =TRUE)# best resultsplot(comparison_majority$pick$object)print(comparison_majority$pick$config)stopCluster(cl); registerDoSEQ()# ====================================================================================# A run with only partitional clusterings# ====================================================================================p_cfgs <- compare_clusterings_configs( types ="p", k =19L:21L, controls = list( partitional = partitional_control( iter.max =20L, nrep =8L)), preprocs = pdc_configs("preproc", none = list(), zscore = list(center = c(FALSE,TRUE))), distances = pdc_configs("distance", sbd = list(), dtw_basic = list(window.size =19L:20L, norm = c("L1","L2")), gak = list(window.size =19L:20L, sigma =100)), centroids = pdc_configs("centroid", partitional = list( pam = list(), shape = list())))# Remove redundant (shape centroid always uses zscore preprocessing)id_redundant <- p_cfgs$partitional$preproc =="none"& p_cfgs$partitional$centroid =="shape"p_cfgs$partitional <- p_cfgs$partitional[!id_redundant,]# LONG! 30 minutes or so, sequentiallycomparison_partitional <- compare_clusterings(CharTraj, types ="p", configs = p_cfgs, seed =32903L, trace =TRUE, score.clus = score_fun, pick.clus = pick_fun, shuffle.configs =TRUE, return.objects =TRUE)## End(Not run)