Calculate the purity of the clustering results. For example, see if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Schaeffer_etal_2016_trust;textual",package="funtimes",cached_env=.Rdpack.currefs) .
purity(classes, clusters)
Arguments
classes: a vector with labels of true classes.
clusters: a vector with labels of assigned clusters for which purity is to be tested. Should be of the same length as classes.
Returns
A list with two elements: - pur: purity value.
out: table with min(K,J) = min(length(unique(classes)), length(unique(clusters))) rows and the following columns: ClassLabels, ClusterLabels, and ClusterSize.
Details
Following if(!exists(".Rdpack.currefs")) .Rdpack.currefs <-new.env();Rdpack::insert_citeOnly(keys="Manning_etal_2008;textual",package="funtimes",cached_env=.Rdpack.currefs) , each cluster is assigned to the class which is most frequent in the cluster, then
Purity(Ω,C)=N1k∑jmax∣ωk∩cj∣,
where Ω={ω1,…,ωK} is the set of identified clusters and C={c1,…,cJ} is the set of classes. That is, within each class j=1,…,J find the size of the most populous cluster from the K−j unassigned clusters. Then, sum together the min(K,J) sizes found and divide by N, where N = length(classes) = length(clusters).
If maxj∣ωk∩cj∣ is not unique for some j, it is assigned to the class which the second maximum is the smallest, to maximize the Purity (see `Examples').
The number of unique elements in classes and clusters may differ.
Examples
# Fix seed for reproducible simulations:# RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0set.seed(1)##### Example 1#Create some classes and cluster labels:classes <- rep(LETTERS[1:3], each =5)clusters <- sample(letters[1:5], length(classes), replace =TRUE)#From the table below:# - cluster 'b' corresponds to class A;# - either of the clusters 'd' and 'e' can correspond to class B,# however, 'e' should be chosen, because cluster 'd' also highly # intersects with Class C. Thus,# - cluster 'd' corresponds to class C.table(classes, clusters)## clusters##classes a b c d e## A 0 3 1 0 1## B 1 0 0 2 2## C 1 2 0 2 0#The function does this choice automatically:purity(classes, clusters)#Sample output:##$pur##[1] 0.4666667####$out## ClassLabels ClusterLabels ClusterSize##1 A b 3##2 B e 2##3 C d 2##### Example 2#The labels can be also numeric:classes <- rep(1:5, each =3)clusters <- sample(1:3, length(classes), replace =TRUE)purity(classes, clusters)