discdd.misclass() R function from [dad]

Misclassification ratio in functional discriminant analysis of discrete probability distributions.

Computes the one-leave-out misclassification ratio of the rule assigning $T$ groups of individuals, one group after another, to the class of groups (among $K$ classes of groups) which achieves the minimum of the distances or divergences between the probability distribution associated to the group to assign and the $K$ probability distributions associated to the $K$ classes.


discdd.misclass(xf, class.var, distance =  c("l1", "l2", "chisqsym", "hellinger",
           "jeffreys", "jensen", "lp"), crit = 1, p)

Arguments

xf: object of class folderh with two data frames or list of arrays (or tables).
- If it is a folderh:
  - The first data.frame has at least two columns. One column contains the names of the $T$ groups (all the names must be different). An other column is a factor with $K$ levels partitionning the T groups into K classes.
  - The second one has $(q+1)$ columns. The first $q$ columns are factors (otherwise, they are coerced into factors). The last column is a factor with $T$ levels defining $T$ groups. Each group, say $t$ , consists of $n_t$ individuals.
- If it is a list of arrays or tables, the $t^{th}$ element ( $t = 1, \ldots, T$ ) is the table of the joint distribution (absolute or relative frequencies) of the $t^{th}$ group. These arrays have the same shape:
  
  Each array (or table) xf[[i]] has:
  - the same dimension(s). If $q = 1$ (univariate), dim(xf[[i]]) is an integer. If $q \\> 1$ (multivariate), dim(xf[[i]]) is an integer vector of length q.
  - the same dimension names dimnames(xf[[i]]) (is non NULL). These dimnames are the names of the variables.
class.var: string (if xf is an object of class "folderh") or data.frame with two columns (if xf is a list of arrays).
- If xf is of class "folder", class.var is the name of the class variable.
- If xf is a list of arrays or a list of tables, class.var is a data.frame with at least two columns named "group" and "class". The "group" column contains the names of the $T$ groups (all the names must be different). The "class" column is a factor with $K$ levels partitioning the $T$ groups into $K$ classes.
distance: The distance or dissimilarity used to compute the distance matrix between the densities. It can be:
- "l1" (default) the $L^p$ distance with $p = 1$
- "l2" the $L^p$ distance with $p = 2$
- "chisqsym" the symmetric Chi-squared distance
- "hellinger" the Hellinger metric (Matusita distance)
- "jeffreys" Jeffreys distance (symmetrised Kullback-Leibler divergence)
- "jensen" the Jensen-Shannon distance
- "lp" the $L^p$ distance with $p$ given by the argument p of the function.
crit: 1 or 2. In order to select the densities associated to the classes. See Details.
p: integer. Optional. When distance = "lp" ( $L^p$ distance with $p>2$ ), p is the parameter of the distance.

Details

If xf is an object of class "folderh" containing the data:

The $T$ probability distributions $f_t$ corresponding to the $T$ groups of individuals are estimated by frequency distributions within each group.

To the class $k$ consisting of $T_k$ groups is associated the probability distribution $g_k$ , knowing that when using the one-leave-out method, we do not include the group to assign in its class $k$ . The crit argument selects the estimation method of the $g_k$ 's.
- crit=1
  
  The probability distribution $g_k$ is estimated using the whole data of this class, that is the rows of x corresponding to the $T_k$ groups of the class $k$ .
  
  The estimation of the $g_k$ 's uses the same method as the estimation of the $f_t$ 's.
- crit=2
  
  The $T_k$ probability distributions $f_t$ are estimated using the corresponding data from xf. Then they are averaged to obtain an estimation of the density $g_k$ , that is $g_k = (1/T_k)\sum{f_t}$ .
If xf is a list of arrays (or list of tables):

The $t^{th}$ array is the joint frequency distribution of the $t^{th}$ group. The frequencies can be absolute or relative.

To the class $k$ consisting of $T_k$ groups is associated the probability distribution $g_k$ , knowing that when using the one-leave-out method, we do not include the group to assign in its class $k$ . The crit argument selects the estimation method of the $g_k$ 's.
- crit=1
  
  $g_k = (1/\sum n_t) \sum n_t f_t$ , where $n_t$ is the total of xf[[t]].
  
  Notice that when xf[[t]] contains relative frequencies, its total is 1. That is equivalent to crit=2.
- crit=2
  
  $g_k = (1/T_k)\sum f_t$ .

Returns

Returns an object of class discdd.misclass, that is a list including: - classification: data frame with 4 columns:

 * factor giving the group name. The column name is the same as that of the column ($q+1$) of `x`,
 * the prior class of the group if it is available, or NA if not,
 * `alloc`: the class allocation computed by the discriminant analysis method,
 * `misclassed`: boolean. `TRUE` if the group is misclassed, `FALSE` if it is well-classed, `NA` if the prior class of the group is unknown.

confusion.mat: confusion matrix,
misalloc.per.class: the misclassification ratio per class,
misclassed: the misclassification ratio,
distances: matrix with $T$ rows and $K$ columns, of the distances ( $d_{tk}$ ): $d_{tk}$ is the distance between the group $t$ and the class $k$ ,
proximities: matrix of the proximity indices (in percents) between the groups and the classes. The proximity between the group $t$ and the class $k$ is: $(1/d_{tk})/\sum_{l=1}^{l=K}(1/d_{tl})$ .

References

Rudrauf, J.M., Boumaza, R. (2001). Contribution à l'étude de l'architecture médiévale: les caractéristiques des pierres à bossage des châteaux forts alsaciens, Centre de Recherches Archéologiques médiévales de Saverne, 5, 5-38.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

Examples


# Example 1 with a folderh obtained by converting numeric variables
data("castles.dated")
stones <- castles.dated$stones
periods <- castles.dated$periods
stones$height <- cut(stones$height, breaks = c(19, 27, 40, 71), include.lowest = TRUE)
stones$width <- cut(stones$width, breaks = c(24, 45, 62, 144), include.lowest = TRUE)
stones$edging <- cut(stones$edging, breaks = c(0, 3, 4, 8), include.lowest = TRUE)
stones$boss <- cut(stones$boss, breaks = c(0, 6, 9, 20), include.lowest = TRUE )

castlefh <- folderh(periods, "castle", stones)

# Default: dist="l1", crit=1
discdd.misclass(castlefh, "period")

# Hellinger distance, crit=2
discdd.misclass(castlefh, "period", distance = "hellinger", crit = 2)

# Example 2 with a list of 96 arrays
data("dspgd2015")
data("departments")
classes <- departments[, c("coded", "namer")]
names(classes) <- c("group", "class")

# Default: dist="l1", crit=1
discdd.misclass(dspgd2015, classes)

# Hellinger distance, crit=2
discdd.misclass(dspgd2015, classes, distance = "hellinger", crit = 2)

dad package Read PDF manual

Maintainer: Pierre Santagostini
License: GPL (>= 2)
Last published: 2024-11-22

Useful links

discdd.misclass function