distantia_cluster_hclust function

Hierarchical Clustering of Dissimilarity Analysis Data Frames

Hierarchical Clustering of Dissimilarity Analysis Data Frames

This function combines the dissimilarity scores computed by distantia(), the agglomerative clustering methods provided by stats::hclust(), and the clustering optimization method implemented in utils_cluster_hclust_optimizer() to help group together time series with similar features.

When clusters = NULL, the function utils_cluster_hclust_optimizer() is run underneath to perform a parallelized grid search to find the number of clusters maximizing the overall silhouette width of the clustering solution (see utils_cluster_silhouette()). When method = NULL as well, the optimization also includes all methods available in stats::hclust() in the grid search.

This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.

distantia_cluster_hclust(df = NULL, clusters = NULL, method = "complete")

Arguments

  • df: (required, data frame) Output of distantia(), distantia_ls(), distantia_dtw(), or distantia_time_delay(). Default: NULL
  • clusters: (required, integer) Number of groups to generate. If NULL (default), utils_cluster_kmeans_optimizer() is used to find the number of clusters that maximizes the mean silhouette width of the clustering solution (see utils_cluster_silhouette()). Default: NULL
  • method: (optional, character string) Argument of stats::hclust() defining the agglomerative method. One of: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Unambiguous abbreviations are accepted as well. If NULL (default), utils_cluster_hclust_optimizer() finds the optimal method. Default: "complete".

Returns

list:

  • cluster_object: hclust object for further analyses and custom plotting.
  • clusters: integer, number of clusters.
  • silhouette_width: mean silhouette width of the clustering solution.
  • df: data frame with time series names, their cluster label, and their individual silhouette width scores.
  • d: psi distance matrix used for clustering.
  • optimization: only if clusters = NULL, data frame with optimization results from utils_cluster_hclust_optimizer().

Examples

#weekly covid prevalence in California tsl <- tsl_initialize( x = covid_prevalence, name_column = "name", time_column = "time" ) #subset 10 elements to accelerate example execution tsl <- tsl_subset( tsl = tsl, names = 1:10 ) if(interactive()){ #plotting first three time series tsl_plot( tsl = tsl[1:3], guide_columns = 3 ) } #dissimilarity analysis distantia_df <- distantia( tsl = tsl, lock_step = TRUE ) #hierarchical clustering #automated number of clusters #automated method selection distantia_clust <- distantia_cluster_hclust( df = distantia_df, clusters = NULL, method = NULL ) #names of the output object names(distantia_clust) #cluster object distantia_clust$cluster_object #distance matrix used for clustering distantia_clust$d #number of clusters distantia_clust$clusters #clustering data frame #group label in column "cluster" #negatives in column "silhouette_width" higlight anomalous cluster assignation distantia_clust$df #mean silhouette width of the clustering solution distantia_clust$silhouette_width #plot if(interactive()){ dev.off() clust <- distantia_clust$cluster_object k <- distantia_clust$clusters #tree plot plot( x = clust, hang = -1 ) #highlight groups stats::rect.hclust( tree = clust, k = k, cluster = stats::cutree( tree = clust, k = k ) ) }

See Also

Other distantia_support: distantia_aggregate(), distantia_boxplot(), distantia_cluster_kmeans(), distantia_matrix(), distantia_model_frame(), distantia_spatial(), distantia_stats(), distantia_time_delay(), utils_block_size(), utils_cluster_hclust_optimizer(), utils_cluster_kmeans_optimizer(), utils_cluster_silhouette()

  • Maintainer: Blas M. Benito
  • License: MIT + file LICENSE
  • Last published: 2025-02-01