data: a mousetrap data object created using one of the mt_import functions (see mt_example for details). Alternatively, a trajectory array can be provided directly (in this case use will be ignored).
use: a character string specifying which trajectory data should be used.
dimensions: a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively.
kseq: a numeric vector specifying set of candidates for k. Defaults to 2:15, implying that all values of k within that range are compared using the metrics specified in compute.
compute: character vector specifying the to be computed measures. Can be any subset of c("stability","gap","jump","slope").
method: character string specifying the type of clustering procedure for the stability-based method. Either hclust or kmeans.
weights: numeric vector specifying the relative importance of the variables specified in dimensions. Defaults to a vector of 1s implying equal importance. Technically, each variable is rescaled so that the standard deviation matches the corresponding value in weights. To use the original variables, set weights = NULL.
pointwise: boolean specifying the way in which dissimilarity between the trajectories is measured. If TRUE (the default), mt_distmat measures the average dissimilarity and then sums the results. If FALSE, mt_distmat measures dissimilarity once (by treating the various points as independent dimensions). This is only relevant if method is "hclust". See mt_distmat for further details.
minkowski_p: an integer specifying the distance metric for the cluster solution. minkowski_p = 1 computes the city-block distance, minkowski_p = 2 (the default) computes the Euclidian distance, minkowski_p = 3 the cubic distance, etc. Only relevant if method is "hclust". See mt_distmat for further details.
hclust_method: character string specifying the linkage criterion used. Passed on to the method argument of hclust . Default is set to ward.D. Only relevant if method is "hclust".
kmeans_nstart: integer specifying the number of reruns of the kmeans procedure. Larger numbers minimize the risk of finding local minima. Passed on to the nstart argument of kmeans . Only relevant if method is "kmeans".
n_bootstrap: an integer specifying the number of bootstrap comparisons used by stability. See cStability .
model_based: boolean specifying whether the model-based or the model-free should be used by stability, when method is kmeans. See cStability and Haslbeck & Wulff (2020).
n_gap: integer specifying the number of simulated datasets used by gap. See Tibshirani et al. (2001).
na_rm: logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories.
verbose: logical indicating whether function should report its progress.
Returns
A list containing two lists that store the results of the different methods. kopt contains the estimated k for each of the methods specified in compute. paths contains the values for each k in kseq as computed by each of the methods specified in compute. The values in kopt are optima for each of the vectors in paths.
Details
mt_cluster_k estimates the number of clusters (k) using four commonly used k-selection methods (specified via compute): cluster stability (stability), the gap statistic (gap), the jump statistic (jump), and the slope statistic (slope).
Cluster stability methods select k as the number of clusters for which the assignment of objects to clusters is most stable across bootstrap samples. This function implements the model-based and model-free methods described by Haslbeck & Wulff (2020). See references.
The remaining three methods select k as the value that optimizes the gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic (Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, & Patriota, 2014), respectively.
For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize ; Wulff et al., 2019).
Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.
Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
See Also
mt_distmat for more information about how the distance matrix is computed when the hclust method is used.
mt_cluster for performing trajectory clustering with a specified number of clusters.