Estimates Number of Clusters using up to 26 Indicators
Estimates Number of Clusters using up to 26 Indicators
Calculation of up to 26 indicators and the recommendations based on them for the number of clusters in data sets. For a given dataset and clusterings for this dataset, key indicators mentioned in details are calculated and based on this a recommendation regarding the number of clusters is given for each indicator.
An alternative estimation of the cluster number can be done by counting the valleys of the topographic map of the generalized U-Matrix for a specfic projection method using the ProjectionBasesdClustering and GeneralizedUmatrix packages on CRAN, see [Thrun/Ultsch, 2021] for details.
DataOrDistances: Either [1:n,1:d] matrix of dataset to be clustered. It consists of n cases of d-dimensional data points. Every case has d attributes, variables or features.
Symmetric [1:n,1:n] distance matrix
ClsMatrix: [1:n,1:(MaxClusterNo)] matrix of clusterings each columns is defined as:
1:n numerical vector of numbers defining the classification as the main output of the clustering algorithm for the n cases of data. It has k unique numbers representing the arbitrary labels of the clustering.
(see also details (2) and (3)), must be specified if method = NULL
MaxClusterNo: Highest number of clusters to be checked
Method: Cluster procedure, with which the clusterings are created (see details (4) for possible methods), must be specified if ClsMatrix = NULL
ClusterIndex: String or vector of strings with the indicators to be calculated (see details (1)), default = "all
MinClusterNo: Lowest number of clusters to be checked, default = 2
Silent: If TRUE status messages are output, default = FALSE
PlotIt: If TRUE plots fanplot with proposed cluster numbers
SelectByABC: If PlotIt=TRUE, TRUE: Plots group A of ABCanalysis of the most important ones (highest overlap in indicators), FALSE: plots all indicators
Colorsequence: Optional, character vector of sufficient length of colors for the fan plot.If the sequence is too long the first part of the sequence is used.
...: Optional, further arguents used if clustering methods if Method is set.
Each column of ClsMatrix has to have at least two unqiue clusters defined. Otherwise the function will stop.
The following 26 indicators can be calculated: "ball", "beale", "calinski", "ccc", "cindex", "db", "duda", "dunn", "frey", "friedman", "hartigan", "kl", "marriot", "mcclain", "pseudot2", "ptbiserial", "ratkowsky", "rubin", "scott", "sdbw", "sdindex", "silhouette", "ssi", "tracew", "trcovw", "xuindex".
These can be specified individually or as a vector via the parameter index. If you enter 'all', all key figures are calculated.
The indicators kl, duda, pseudot2, beale, frey and mcclain require a clustering for MaxClusterNo+1 clusters. If these key figures are to be calculated, this clustering must be specified in cls.
The indicator kl requires a clustering for MinClusterNo-1 clusters. If this key figure is to be calculated, this clustering must also be specified in cls. For the case MinClusterNo = 2 no clustering for 1 has to be given.
The following methods can be used to create clusterings:
"kmeans," "DBSclustering","DivisiveAnalysisClustering","FannyClustering", "ModelBasedClustering","SpectralClustering" or all methods found in HierarchicalClustering.
The indicators duda, pseudot2, beale and frey are only intended for use in hierarchical cluster procedures.
If a distances matrix is given, then ProjectionBasedClustering is required to be accessible.
Indicators: A table of the calculated indicators except Duda, Pseudot2 and Beale
ClusterNo: The recommended number of clusters for each calculated indicator
ClsMatrix: [1:n,MinClusterNo:(MaxClusterNo)] Output of the clusterings used for the calculation
HierarchicalIndicators: Either NULL or the values for the indicators Duda, Pseudot2 and Beale in case of hierarchical cluster procedures, if calculated
Charrad, Malika, et al. "Package 'NbClust', J. Stat. Soft Vol. 61, pp. 1-36, 2014.
[Thrun/Ultsch, 2021] Thrun, M. C., and Ultsch, A.: Swarm Intelligence for Self-Organized Clustering, Artificial Intelligence, Vol. 290, pp. 103237, tools:::Rd_expr_doi("10.1016/j.artint.2020.103237") , 2021.
Peter Nahrgang, revised by Michael Thrun (2021)
Code of "calinski", "cindex", "db", "hartigan", "ratkowsky", "scott", "marriot", "ball", "trcovw", "tracew", "friedman", "rubin", "ssi" of package cclust ist adapted for the purpose of this function.
Colorsequence works if DataVisualizations 1.1.13 is installed (currently only on github available).
# Reading the iris dataset from the standard R-Package datasetsdata <- as.matrix(iris[,1:4])MaxClusterNo =7# Creating the clusterings for the data set#(here with method complete) for the number of clusters 2 to 8hc <- hclust(dist(data), method ="complete")clsm <- matrix(data =0, nrow = dim(data)[1],ncol = MaxClusterNo)for(i in2:(MaxClusterNo+1)){ clsm[,i-1]<- cutree(hc,i)}# Calculation of all indicators and recommendations for the number of clustersindicatorsList=ClusterNoEstimation(Data = data,ClsMatrix = clsm, MaxClusterNo = MaxClusterNo)# Alternatively, the same calculation as above can be executed with the following callClusterNoEstimation(Data = data, MaxClusterNo =7, Method ="CompleteL")# In this variant, the function clusterumbers also takes over the clustering