discdd.predict function

Predicting the class of a group of individuals with discriminant analysis of probability distributions.

Predicting the class of a group of individuals with discriminant analysis of probability distributions.

Assigns several groups of individuals, one group after another, to the class of groups (among KK classes of groups) which achieves the minimum of the distances or divergences between the probability distribution associated to the group to assign and the KK probability distributions associated to the KK classes.

discdd.predict(xf, class.var, distance = c("l1", "l2", "chisqsym", "hellinger", "jeffreys", "jensen", "lp"), crit = 1, misclass.ratio = FALSE, p)

Arguments

  • xf: object of class folderh with two data frames or list of arrays (or tables).

    • If it is a folderh:

      • The first data.frame has at least two columns. One column contains the names of the TT groups (all the names must be different). An other column is a factor with KK levels partitionning the T groups into K classes.
      • The second one has (q+1)(q+1) columns. The first qq columns are factors (otherwise, they are coerced into factors). The last column is a factor with TT levels defining TT groups. Each group, say tt, consists of ntn_t individuals.
    • If it is a list of arrays or tables, the ttht^{th} element (t=1,,Tt = 1, \ldots, T) is the table of the joint distribution (absolute or relative frequencies) of the ttht^{th} group. These arrays have the same shape:

      Each array (or table) xf[[i]] has:

      • the same dimension(s). If q=1q = 1 (univariate), dim(xf[[i]]) is an integer. If q>1q \\> 1 (multivariate), dim(xf[[i]]) is an integer vector of length q.
      • the same dimension names dimnames(xf[[i]]) (is non NULL). These dimnames are the names of the variables.
  • class.var: string (if xf is an object of class "folderh") or data.frame with two columns (if xf is a list of arrays).

    • If xf is of class "folder", class.var is the name of the class variable.
    • If xf is a list of arrays or a list of tables, class.var is a data.frame with at least two columns named "group" and "class". The "group" column contains the names of the TT groups (all the names must be different). The "class" column is a factor with KK levels partitioning the TT groups into KK classes.
  • distance: The distance or dissimilarity used to compute the distance matrix between the densities. It can be:

    • "l1" (default) the LpL^p distance with p=1p = 1
    • "l2" the LpL^p distance with p=2p = 2
    • "chisqsym" the symmetric Chi-squared distance
    • "hellinger" the Hellinger metric (Matusita distance)
    • "jeffreys" Jeffreys distance (symmetrised Kullback-Leibler divergence)
    • "jensen" the Jensen-Shannon distance
    • "lp" the LpL^p distance with pp given by the argument p of the function.
  • crit: 1 or 2. In order to select the densities associated to the classes. See Details.

  • misclass.ratio: logical (default FALSE). If TRUE, the confusion matrix and misclassification ratio are computed on the groups whose prior class is known. In order to compute the misclassification ratio by the one-leave-out method, use the discdd.misclass function.

  • p: integer. Optional. When distance = "lp" (LpL^p distance with p>2p>2), p is the parameter of the distance.

Details

  • If xf is an object of class "folderh" containing the data:

    The TT probability distributions ftf_t corresponding to the TT groups of individuals are estimated by frequency distributions within each group.

    To the class kk consisting of TkT_k groups is associated the probability distribution gkg_k. The crit argument selects the estimation method of the gkg_k's.

    • crit=1

      The probability distribution gkg_k is estimated using the whole data of this class, that is the rows of x corresponding to the TkT_k groups of the class kk.

      The estimation of the gkg_k's uses the same method as the estimation of the ftf_t's.

    • crit=2

      The TkT_k probability distributions ftf_t are estimated using the corresponding data from xf. Then they are averaged to obtain an estimation of the density gkg_k, that is gk=(1/Tk)ftg_k = (1/T_k)\sum{f_t}.

  • If xf is a list of arrays (or list of tables):

    The ttht^{th} array is the joint frequency distribution of the ttht^{th} group. The frequencies can be absolute or relative.

    To the class kk consisting of TkT_k groups is associated the probability distribution gkg_k. The crit argument selects the estimation method of the gkg_k's.

    • crit=1

      gk=(1/nt)ntftg_k = (1/\sum n_t) \sum n_t f_t, where ntn_t is the total of xf[[t]].

      Notice that when xf[[t]] contains relative frequencies, its total is 1. That is equivalent to crit=2.

    • crit=2

      gk=(1/Tk)ftg_k = (1/T_k)\sum f_t.

Returns

Returns an object of class discdd.predict, that is a list including: - prediction: data frame with 3 columns:

 * factor giving the group name. The column name is the same as that of the column ($q+1$) of `x`,
 * `class.known`: the prior class of the group if it is available, or NA if not,
 * `class.predict`: the class allocation predicted by the discriminant analysis method. If `misclass.ratio = TRUE`, the class allocations are computed for all groups. Otherwise (default), they are computed only for the groups whose class is unknown.
  • distances: matrix with TT rows and KK columns, of the distances (dtkd_{tk}): dtkd_{tk} is the distance between the group tt and the class kk, computed with the measure given by argument,

  • proximities: matrix of the proximities (in percents). The proximity of a group tt to the class kk is computed as so: (1/dtk)/l=1l=K(1/dtl)(1/d_{tk})/\sum_{l=1}^{l=K}(1/d_{tl}).

  • confusion.mat: the confusion matrix (if misclass.ratio = TRUE)

  • misclassed: the misclassification ratio (if misclass.ratio = TRUE)

References

Rudrauf, J.M., Boumaza, R. (2001). Contribution à l'étude de l'architecture médiévale: les caractéristiques des pierres à bossage des châteaux forts alsaciens, Centre de Recherches Archéologiques médiévales de Saverne, 5, 5-38.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

Examples

data(castles.dated) data(castles.nondated) stones <- rbind(castles.dated$stones, castles.nondated$stones) periods <- rbind(castles.dated$periods, castles.nondated$periods) stones$height <- cut(stones$height, breaks = c(19, 27, 40, 71), include.lowest = TRUE) stones$width <- cut(stones$width, breaks = c(24, 45, 62, 144), include.lowest = TRUE) stones$edging <- cut(stones$edging, breaks = c(0, 3, 4, 8), include.lowest = TRUE) stones$boss <- cut(stones$boss, breaks = c(0, 6, 9, 20), include.lowest = TRUE ) castlesfh <- folderh(periods, "castle", stones) # Default: dist="l1", crit=1 discdd.predict(castlesfh, "period") # With the calculation of the confusion matrix and misclassification ratio discdd.predict(castlesfh, "period", misclass.ratio = TRUE) # Hellinger distance discdd.predict(castlesfh, "period", distance = "hellinger") # crit=2 discdd.predict(castlesfh, "period", crit = 2)
  • Maintainer: Pierre Santagostini
  • License: GPL (>= 2)
  • Last published: 2024-11-22