K-Means Clustering of Dissimilarity Analysis Data Frames
K-Means Clustering of Dissimilarity Analysis Data Frames
This function combines the dissimilarity scores computed by distantia(), the K-means clustering method implemented in stats::kmeans(), and the clustering optimization method implemented in utils_cluster_hclust_optimizer() to help group together time series with similar features.
When clusters = NULL, the function utils_cluster_hclust_optimizer() is run underneath to perform a parallelized grid search to find the number of clusters maximizing the overall silhouette width of the clustering solution (see utils_cluster_silhouette()).
This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.
df: (required, data frame) Output of distantia(), distantia_ls(), distantia_dtw(), or distantia_time_delay(). Default: NULL
clusters: (required, integer) Number of groups to generate. If NULL (default), utils_cluster_kmeans_optimizer() is used to find the number of clusters that maximizes the mean silhouette width of the clustering solution (see utils_cluster_silhouette()). Default: NULL
seed: (optional, integer) Random seed to be used during the K-means computation. Default: 1
Returns
list:
cluster_object: kmeans object object for further analyses and custom plotting.
clusters: integer, number of clusters.
silhouette_width: mean silhouette width of the clustering solution.
df: data frame with time series names, their cluster label, and their individual silhouette width scores.
d: psi distance matrix used for clustering.
optimization: only if clusters = NULL, data frame with optimization results from utils_cluster_hclust_optimizer().
Examples
#weekly covid prevalence in Californiatsl <- tsl_initialize( x = covid_prevalence, name_column ="name", time_column ="time")#subset 10 elements to accelerate example executiontsl <- tsl_subset( tsl = tsl, names =1:10)if(interactive()){#plotting first three time series tsl_plot( tsl = tsl[1:3], guide_columns =3)}#dissimilarity analysisdistantia_df <- distantia( tsl = tsl, lock_step =TRUE)#hierarchical clustering#automated number of clustersdistantia_kmeans <- distantia_cluster_kmeans( df = distantia_df, clusters =NULL)#names of the output objectnames(distantia_kmeans)#kmeans objectdistantia_kmeans$cluster_object
#distance matrix used for clusteringdistantia_kmeans$d
#number of clustersdistantia_kmeans$clusters
#clustering data frame#group label in column "cluster"distantia_kmeans$df
#mean silhouette width of the clustering solutiondistantia_kmeans$silhouette_width
#kmeans plot# factoextra::fviz_cluster(# object = distantia_kmeans$cluster_object,# data = distantia_kmeans$d,# repel = TRUE# )