kmeansClustering function

K-Means Clustering

K-Means Clustering

Perform k-means clustering on a data matrix. 1.1

kmeansClustering(DataOrDistances, ClusterNo, Type = 'LBG',RandomNo=5000, CategoricalData, PlotIt=FALSE, Verbose = FALSE,... )

Arguments

  • DataOrDistances: Either nonsymmetric [1:n,1:d] datamatrix of n cases and d numerical features or

    symmetric [1:n,1:n] distance matrix

  • ClusterNo: A number k which defines k different clusters to be built by the algorithm.

  • Type: Choice of Kmeans algorithm, currently either " Hartigan" [Hartigan/Wong, 1979], "LBG" [Linde et al., 1980], "Sparse" sparse k-means proposed in [Witten/Tibshirani, 2010], "Steinley" best method of [Steinley/Brusco, 2007] proposed in Steinley 2003, "Lloyd" [Lloyd, 1982], "Forgy"[Forgy, 1965], MacQueen [MacQueen, 1967], kcentroids [Leisch, 2006], "kprototypes" [Szepannek, 2018], "Pelleg-moore" [Pelleg & Moores,2000], "Elkan" [Elkan, 2003], "kmeans++"" [Arthur & Vassilvitskii], Hamerly"[Hamerly, 2010] ,Dualtree" or Dualtree-covertree [Curtin, 2017]"

  • RandomNo: Only for " Steinley" or in case of distance matrix, number of random initializations with searching for minimal SSE, see [Steinley/Brusco, 2007]

  • CategoricalData: Only for " kprototypes", [1:n,1:m] matrix of categorical features]

  • PlotIt: Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in Cls

  • Verbose: Print details, if true

  • ...: Further arguments like iter.max, nstart, for kcentroids please see kcca function of the flexclust package, or KMeansSparseCluster

Details

Uses either stats package function 'kmeans', cclust package implemention, flexclust package implemention or own code. In case of a distance matrix, RandomNo should be significantly lower than 5000, otherwise a long computation time is to be expected.

Returns

List V of

  • Cls: [1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering.

  • Object: Object of the clustering algorithm used if existent, otherwise

    SumDistsToCentroids: Vector of within-cluster sum of squares, one component per cluster

  • Centroids: the final cluster centers.

References

[Hartigan/Wong, 1979] Hartigan, J. A., & Wong, M. A.: Algorithm AS 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28(1), pp. 100-108. 1979.

[Linde et al., 1980] Linde, Y., Buzo, A., & Gray, R.: An algorithm for vector quantizer design, IEEE Transactions on communications, Vol. 28(1), pp. 84-95. 1980.

[Steinley/Brusco, 2007] Steinley, D., & Brusco, M. J.: Initializing k-means batch clustering: A critical evaluation of several techniques, Journal of Classification, Vol. 24(1), pp. 99-121. 2007.

[Forgy, 1965] Forgy, E. W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics, Vol. 21, pp. 768-769. 1965.

[MacQueen, 1967] MacQueen, J.: Some methods for classification and analysis of multivariate observations, Proc. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281-297, Oakland, CA, USA., 1967.

[Pelleg & Moores,2000] Pelleg, Dan, and Andrew W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters, ICML. Vol. 1. 2000.

[Elkan, 2003] Elkan, Charles: Using the triangle inequality to acceler- ate k-means, In Tom Fawcett and Nina Mishra, editors, ICML, pages Vol.3, 147-153. AAAI Press, 2003.

[Lloyd, 1982] Lloyd, S.: Least squares quantization in PCM, IEEE transactions on information theory, Vol. 28(2), pp. 129-137. 1982.

[Leisch, 2006] Leisch, F.: A toolbox for k-centroids cluster analysis, Computational Statistics & Data Analysis, Vol. 51(2), pp. 526-544. 2006.

[Arthur & Vassilvitskii] Arthur, David, and Vassilvitskii, Sergei: K-means++ the advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007

[Witten/Tibshirani, 2010] Witten, D. and Tibshirani, R.: A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, Vol. 105(490), pp. 713-726, 2010.

[Hamerly, 2010] Hamerly, Greg: Making k-means even faster, Proceedings of the 2010 SIAM international conference on data mining, Society for Industrial and Applied Mathematics, pp. 130-140, 2010.

[Szepannek, 2018] Szepannek, G.: clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal, Vol. 10/2, pp. 200-208, doi:10.32614/RJ2018048, 2018.

[Curtin, 2017] Curtin, Ryan R: A dual-tree algorithm for fast k-means clustering with large k, Proceedings of the 2017 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2017.

Examples

data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE) data('Leukemia') # As expected does not perform well # For non-spherical cluster structures: out=kmeansClustering(Leukemia$DistanceMatrix,ClusterNo=6,RandomNo =10,PlotIt=TRUE) data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE,Type="Steinley") data('Hepta') out=kmeansClustering(Hepta$Data,ClusterNo = 7, Type = "kprototypes",CategoricalData = as.matrix(Hepta$Cls))

Note

The version using a distance matrix is still in the test phase and not yet verified.

Author(s)

Michael Thrun

  • Maintainer: Michael Thrun
  • License: GPL-3
  • Last published: 2023-10-19