purity function

Clustering Purity

Clustering Purity

Calculate the purity of the clustering results. For example, see if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Schaeffer_etal_2016_trust;textual",package="funtimes",cached_env=.Rdpack.currefs) .

purity(classes, clusters)

Arguments

  • classes: a vector with labels of true classes.
  • clusters: a vector with labels of assigned clusters for which purity is to be tested. Should be of the same length as classes.

Returns

A list with two elements: - pur: purity value.

  • out: table with min(K,J)\min(K,J) = min(length(unique(classes)), length(unique(clusters))) rows and the following columns: ClassLabels, ClusterLabels, and ClusterSize.

Details

Following if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Manning_etal_2008;textual",package="funtimes",cached_env=.Rdpack.currefs) , each cluster is assigned to the class which is most frequent in the cluster, then

Purity(Ω,C)=1Nkmaxjωkcj, Purity(\Omega,C) = \frac{1}{N}\sum_{k}\max_{j}|\omega_k\cap c_j|,

where Ω={ω1,,ωK}\Omega=\{\omega_1,\ldots,\omega_K \} is the set of identified clusters and C={c1,,cJ}C=\{c_1,\ldots,c_J\} is the set of classes. That is, within each class j=1,,Jj=1,\ldots,J find the size of the most populous cluster from the KjK-j unassigned clusters. Then, sum together the min(K,J)\min(K,J) sizes found and divide by NN, where NN = length(classes) = length(clusters).

If maxjωkcj\max_{j}|\omega_k\cap c_j| is not unique for some jj, it is assigned to the class which the second maximum is the smallest, to maximize the PurityPurity (see `Examples').

The number of unique elements in classes and clusters may differ.

Examples

# Fix seed for reproducible simulations: # RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0 set.seed(1) ##### Example 1 #Create some classes and cluster labels: classes <- rep(LETTERS[1:3], each = 5) clusters <- sample(letters[1:5], length(classes), replace = TRUE) #From the table below: # - cluster 'b' corresponds to class A; # - either of the clusters 'd' and 'e' can correspond to class B, # however, 'e' should be chosen, because cluster 'd' also highly # intersects with Class C. Thus, # - cluster 'd' corresponds to class C. table(classes, clusters) ## clusters ##classes a b c d e ## A 0 3 1 0 1 ## B 1 0 0 2 2 ## C 1 2 0 2 0 #The function does this choice automatically: purity(classes, clusters) #Sample output: ##$pur ##[1] 0.4666667 ## ##$out ## ClassLabels ClusterLabels ClusterSize ##1 A b 3 ##2 B e 2 ##3 C d 2 ##### Example 2 #The labels can be also numeric: classes <- rep(1:5, each = 3) clusters <- sample(1:3, length(classes), replace = TRUE) purity(classes, clusters)

References

if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_all_ref(.Rdpack.currefs)

Author(s)

Vyacheslav Lyubchich

  • Maintainer: Vyacheslav Lyubchich
  • License: GPL (>= 2)
  • Last published: 2023-03-21

Useful links