distantia_cluster_kmeans function

K-Means Clustering of Dissimilarity Analysis Data Frames

K-Means Clustering of Dissimilarity Analysis Data Frames

This function combines the dissimilarity scores computed by distantia(), the K-means clustering method implemented in stats::kmeans(), and the clustering optimization method implemented in utils_cluster_hclust_optimizer() to help group together time series with similar features.

When clusters = NULL, the function utils_cluster_hclust_optimizer() is run underneath to perform a parallelized grid search to find the number of clusters maximizing the overall silhouette width of the clustering solution (see utils_cluster_silhouette()).

This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.

distantia_cluster_kmeans(df = NULL, clusters = NULL, seed = 1)

Arguments

  • df: (required, data frame) Output of distantia(), distantia_ls(), distantia_dtw(), or distantia_time_delay(). Default: NULL
  • clusters: (required, integer) Number of groups to generate. If NULL (default), utils_cluster_kmeans_optimizer() is used to find the number of clusters that maximizes the mean silhouette width of the clustering solution (see utils_cluster_silhouette()). Default: NULL
  • seed: (optional, integer) Random seed to be used during the K-means computation. Default: 1

Returns

list:

  • cluster_object: kmeans object object for further analyses and custom plotting.
  • clusters: integer, number of clusters.
  • silhouette_width: mean silhouette width of the clustering solution.
  • df: data frame with time series names, their cluster label, and their individual silhouette width scores.
  • d: psi distance matrix used for clustering.
  • optimization: only if clusters = NULL, data frame with optimization results from utils_cluster_hclust_optimizer().

Examples

#weekly covid prevalence in California tsl <- tsl_initialize( x = covid_prevalence, name_column = "name", time_column = "time" ) #subset 10 elements to accelerate example execution tsl <- tsl_subset( tsl = tsl, names = 1:10 ) if(interactive()){ #plotting first three time series tsl_plot( tsl = tsl[1:3], guide_columns = 3 ) } #dissimilarity analysis distantia_df <- distantia( tsl = tsl, lock_step = TRUE ) #hierarchical clustering #automated number of clusters distantia_kmeans <- distantia_cluster_kmeans( df = distantia_df, clusters = NULL ) #names of the output object names(distantia_kmeans) #kmeans object distantia_kmeans$cluster_object #distance matrix used for clustering distantia_kmeans$d #number of clusters distantia_kmeans$clusters #clustering data frame #group label in column "cluster" distantia_kmeans$df #mean silhouette width of the clustering solution distantia_kmeans$silhouette_width #kmeans plot # factoextra::fviz_cluster( # object = distantia_kmeans$cluster_object, # data = distantia_kmeans$d, # repel = TRUE # )

See Also

Other distantia_support: distantia_aggregate(), distantia_boxplot(), distantia_cluster_hclust(), distantia_matrix(), distantia_model_frame(), distantia_spatial(), distantia_stats(), distantia_time_delay(), utils_block_size(), utils_cluster_hclust_optimizer(), utils_cluster_kmeans_optimizer(), utils_cluster_silhouette()

  • Maintainer: Blas M. Benito
  • License: MIT + file LICENSE
  • Last published: 2025-02-01