mt_cluster_k() R function from [mousetrap]

Estimate optimal number of clusters.

Estimates the optimal number of clusters (k) using various methods.


mt_cluster_k(
  data,
  use = "ln_trajectories",
  dimensions = c("xpos", "ypos"),
  kseq = 2:15,
  compute = c("stability", "gap", "jump", "slope"),
  method = "hclust",
  weights = rep(1, length(dimensions)),
  pointwise = TRUE,
  minkowski_p = 2,
  hclust_method = "ward.D",
  kmeans_nstart = 10,
  n_bootstrap = 10,
  model_based = FALSE,
  n_gap = 10,
  na_rm = FALSE,
  verbose = FALSE
)

Arguments

data: a mousetrap data object created using one of the mt_import functions (see mt_example for details). Alternatively, a trajectory array can be provided directly (in this case use will be ignored).
use: a character string specifying which trajectory data should be used.
dimensions: a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively.
kseq: a numeric vector specifying set of candidates for k. Defaults to 2:15, implying that all values of k within that range are compared using the metrics specified in compute.
compute: character vector specifying the to be computed measures. Can be any subset of c("stability","gap","jump","slope").
method: character string specifying the type of clustering procedure for the stability-based method. Either hclust or kmeans.
weights: numeric vector specifying the relative importance of the variables specified in dimensions. Defaults to a vector of 1s implying equal importance. Technically, each variable is rescaled so that the standard deviation matches the corresponding value in weights. To use the original variables, set weights = NULL.
pointwise: boolean specifying the way in which dissimilarity between the trajectories is measured. If TRUE (the default), mt_distmat measures the average dissimilarity and then sums the results. If FALSE, mt_distmat measures dissimilarity once (by treating the various points as independent dimensions). This is only relevant if method is "hclust". See mt_distmat for further details.
minkowski_p: an integer specifying the distance metric for the cluster solution. minkowski_p = 1 computes the city-block distance, minkowski_p = 2 (the default) computes the Euclidian distance, minkowski_p = 3 the cubic distance, etc. Only relevant if method is "hclust". See mt_distmat for further details.
hclust_method: character string specifying the linkage criterion used. Passed on to the method argument of hclust . Default is set to ward.D. Only relevant if method is "hclust".
kmeans_nstart: integer specifying the number of reruns of the kmeans procedure. Larger numbers minimize the risk of finding local minima. Passed on to the nstart argument of kmeans . Only relevant if method is "kmeans".
n_bootstrap: an integer specifying the number of bootstrap comparisons used by stability. See cStability .
model_based: boolean specifying whether the model-based or the model-free should be used by stability, when method is kmeans. See cStability and Haslbeck & Wulff (2020).
n_gap: integer specifying the number of simulated datasets used by gap. See Tibshirani et al. (2001).
na_rm: logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories.
verbose: logical indicating whether function should report its progress.

Returns

A list containing two lists that store the results of the different methods. kopt contains the estimated k for each of the methods specified in compute. paths contains the values for each k in kseq as computed by each of the methods specified in compute. The values in kopt are optima for each of the vectors in paths.

Details

mt_cluster_k estimates the number of clusters (k) using four commonly used k-selection methods (specified via compute): cluster stability (stability), the gap statistic (gap), the jump statistic (jump), and the slope statistic (slope).

Cluster stability methods select k as the number of clusters for which the assignment of objects to clusters is most stable across bootstrap samples. This function implements the model-based and model-free methods described by Haslbeck & Wulff (2020). See references.

The remaining three methods select k as the value that optimizes the gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic (Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, & Patriota, 2014), respectively.

For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize ; Wulff et al., 2019).

Examples


## Not run:

# Length normalize trajectories
KH2017 <- mt_length_normalize(KH2017)

# Find k
results <- mt_cluster_k(KH2017, use="ln_trajectories")

# Retrieve results
results$kopt
results$paths
## End(Not run)

References

Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.

Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.

Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.

Author(s)

Dirk U. Wulff

Jonas M. B. Haslbeck

mousetrap package Read PDF manual

Maintainer: Pascal J. Kieslich
License: GPL-3
Last published: 2024-01-19

Useful links

mt_cluster_k function