Hierarchical Clustering of Dissimilarity Analysis Data Frames
Hierarchical Clustering of Dissimilarity Analysis Data Frames
This function combines the dissimilarity scores computed by distantia(), the agglomerative clustering methods provided by stats::hclust(), and the clustering optimization method implemented in utils_cluster_hclust_optimizer() to help group together time series with similar features.
When clusters = NULL, the function utils_cluster_hclust_optimizer() is run underneath to perform a parallelized grid search to find the number of clusters maximizing the overall silhouette width of the clustering solution (see utils_cluster_silhouette()). When method = NULL as well, the optimization also includes all methods available in stats::hclust() in the grid search.
This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.
df: (required, data frame) Output of distantia(), distantia_ls(), distantia_dtw(), or distantia_time_delay(). Default: NULL
clusters: (required, integer) Number of groups to generate. If NULL (default), utils_cluster_kmeans_optimizer() is used to find the number of clusters that maximizes the mean silhouette width of the clustering solution (see utils_cluster_silhouette()). Default: NULL
method: (optional, character string) Argument of stats::hclust() defining the agglomerative method. One of: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). Unambiguous abbreviations are accepted as well. If NULL (default), utils_cluster_hclust_optimizer() finds the optimal method. Default: "complete".
Returns
list:
cluster_object: hclust object for further analyses and custom plotting.
clusters: integer, number of clusters.
silhouette_width: mean silhouette width of the clustering solution.
df: data frame with time series names, their cluster label, and their individual silhouette width scores.
d: psi distance matrix used for clustering.
optimization: only if clusters = NULL, data frame with optimization results from utils_cluster_hclust_optimizer().
Examples
#weekly covid prevalence in Californiatsl <- tsl_initialize( x = covid_prevalence, name_column ="name", time_column ="time")#subset 10 elements to accelerate example executiontsl <- tsl_subset( tsl = tsl, names =1:10)if(interactive()){#plotting first three time series tsl_plot( tsl = tsl[1:3], guide_columns =3)}#dissimilarity analysisdistantia_df <- distantia( tsl = tsl, lock_step =TRUE)#hierarchical clustering#automated number of clusters#automated method selectiondistantia_clust <- distantia_cluster_hclust( df = distantia_df, clusters =NULL, method =NULL)#names of the output objectnames(distantia_clust)#cluster objectdistantia_clust$cluster_object
#distance matrix used for clusteringdistantia_clust$d
#number of clustersdistantia_clust$clusters
#clustering data frame#group label in column "cluster"#negatives in column "silhouette_width" higlight anomalous cluster assignationdistantia_clust$df
#mean silhouette width of the clustering solutiondistantia_clust$silhouette_width
#plotif(interactive()){ dev.off() clust <- distantia_clust$cluster_object
k <- distantia_clust$clusters
#tree plot plot( x = clust, hang =-1)#highlight groups stats::rect.hclust( tree = clust, k = k, cluster = stats::cutree( tree = clust, k = k
))}