purity() R function from [funtimes]

Clustering Purity

Calculate the purity of the clustering results. For example, see if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Schaeffer_etal_2016_trust;textual",package="funtimes",cached_env=.Rdpack.currefs) .


purity(classes, clusters)

Arguments

classes: a vector with labels of true classes.
clusters: a vector with labels of assigned clusters for which purity is to be tested. Should be of the same length as classes.

Returns

A list with two elements: - pur: purity value.

out: table with $\min(K,J)$ = min(length(unique(classes)), length(unique(clusters))) rows and the following columns: ClassLabels, ClusterLabels, and ClusterSize.

Details

Following if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Manning_etal_2008;textual",package="funtimes",cached_env=.Rdpack.currefs) , each cluster is assigned to the class which is most frequent in the cluster, then

Purity(\Omega,C) = \frac{1}{N}\sum_{k}\max_{j}|\omega_k\cap c_j|,

where $\Omega=\{\omega_1,\ldots,\omega_K \}$ is the set of identified clusters and $C=\{c_1,\ldots,c_J\}$ is the set of classes. That is, within each class $j=1,\ldots,J$ find the size of the most populous cluster from the $K-j$ unassigned clusters. Then, sum together the $\min(K,J)$ sizes found and divide by $N$ , where $N$ = length(classes) = length(clusters).

If $\max_{j}|\omega_k\cap c_j|$ is not unique for some $j$ , it is assigned to the class which the second maximum is the smallest, to maximize the $Purity$ (see `Examples').

The number of unique elements in classes and clusters may differ.

Examples


# Fix seed for reproducible simulations:
# RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0
set.seed(1)

##### Example 1
#Create some classes and cluster labels:
classes <- rep(LETTERS[1:3], each = 5)
clusters <- sample(letters[1:5], length(classes), replace = TRUE)

#From the table below:
# - cluster 'b' corresponds to class A;
# - either of the clusters 'd' and 'e' can correspond to class B,
#   however, 'e' should be chosen, because cluster 'd' also highly 
#   intersects with Class C. Thus,
# - cluster 'd' corresponds to class C.
table(classes, clusters)
##       clusters
##classes a b c d e
##      A 0 3 1 0 1
##      B 1 0 0 2 2
##      C 1 2 0 2 0

#The function does this choice automatically:
purity(classes, clusters)

#Sample output:
##$pur
##[1] 0.4666667
##
##$out
##  ClassLabels ClusterLabels ClusterSize
##1           A             b           3
##2           B             e           2
##3           C             d           2

##### Example 2
#The labels can be also numeric:
classes <- rep(1:5, each = 3)
clusters <- sample(1:3, length(classes), replace = TRUE)
purity(classes, clusters)

References

if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_all_ref(.Rdpack.currefs)

Author(s)

Vyacheslav Lyubchich

funtimes package Read PDF manual

Maintainer: Vyacheslav Lyubchich
License: GPL (>= 2)
Last published: 2023-03-21

Useful links

purity function