Categorical function

Categorical distribution

Categorical distribution

Probability mass function, distribution function, quantile function and random generation for the categorical distribution.

dcat(x, prob, log = FALSE) pcat(q, prob, lower.tail = TRUE, log.p = FALSE) qcat(p, prob, lower.tail = TRUE, log.p = FALSE, labels) rcat(n, prob, labels) rcatlp(n, log_prob, labels)

Arguments

  • x, q: vector of quantiles.

  • prob, log_prob: vector of length mm, or mm-column matrix of non-negative weights (or their logarithms in log_prob).

  • log, log.p: logical; if TRUE, probabilities p are given as log(p).

  • lower.tail: logical; if TRUE (default), probabilities are P[Xx]P[X \le x]

    otherwise, P[X>x]P[X > x].

  • p: vector of probabilities.

  • labels: if provided, labeled factor vector is returned. Number of labels needs to be the same as number of categories (number of columns in prob).

  • n: number of observations. If length(n) > 1, the length is taken to be the number required.

Details

Probability mass function

Pr(X=k)=wkj=1mwjPr(X=k)=w[k]/sum(w) \Pr(X = k) = \frac{w_k}{\sum_{j=1}^m w_j}Pr(X = k) = w[k]/sum(w)

Cumulative distribution function

Pr(Xk)=i=1kwij=1mwjPr(X<=k)=sum(w[1:k])/sum(w) \Pr(X \le k) = \frac{\sum_{i=1}^k w_i}{\sum_{j=1}^m w_j}Pr(X <= k) = sum(w[1:k])/sum(w)

It is possible to sample from categorical distribution parametrized by vector of unnormalized log-probabilities α[1],...,α[m]\alpha[1],...,\alpha[m]

without leaving the log space by employing the Gumbel-max trick (Maddison, Tarlow and Minka, 2014). If g[1],...,g[m]g[1],...,g[m] are samples from Gumbel distribution with cumulative distribution function F(g)=exp(exp(g))F(g) = exp(-exp(-g)), then k=argmax(g[i]+α[i])k = argmax(g[i]+\alpha[i])

is a draw from categorical distribution parametrized by vector of probabilities p[1]....,p[m]p[1]....,p[m], such that p[i]=exp(α[i])/sum(exp(α))p[i] = exp(\alpha[i])/sum(exp(\alpha)). This is implemented in rcatlp function parametrized by vector of log-probabilities log_prob.

Examples

# Generating 10 random draws from categorical distribution # with k=3 categories occuring with equal probabilities # parametrized using a vector rcat(10, c(1/3, 1/3, 1/3)) # or with k=5 categories parametrized using a matrix of probabilities # (generated from Dirichlet distribution) p <- rdirichlet(10, c(1, 1, 1, 1, 1)) rcat(10, p) x <- rcat(1e5, c(0.2, 0.4, 0.3, 0.1)) plot(prop.table(table(x)), type = "h") lines(0:5, dcat(0:5, c(0.2, 0.4, 0.3, 0.1)), col = "red") p <- rdirichlet(1, rep(1, 20)) x <- rcat(1e5, matrix(rep(p, 2), nrow = 2, byrow = TRUE)) xx <- 0:21 plot(prop.table(table(x))) lines(xx, dcat(xx, p), col = "red") xx <- seq(0, 21, by = 0.01) plot(ecdf(x)) lines(xx, pcat(xx, p), col = "red", lwd = 2) pp <- seq(0, 1, by = 0.001) plot(ecdf(x)) lines(qcat(pp, p), pp, col = "red", lwd = 2)

References

Maddison, C. J., Tarlow, D., & Minka, T. (2014). A* sampling. [In:] Advances in Neural Information Processing Systems (pp. 3086-3094). https://arxiv.org/abs/1411.0030

  • Maintainer: Tymoteusz Wolodzko
  • License: GPL-2
  • Last published: 2023-11-30