x: matrix or dataframe with categorical (character/factor/logical) or metric (numeric) predictors.
y: class vector (character/factor/logical).
formula: an object of class "formula" (or one that can be coerced to "formula") of the form: class ~ predictors (class has to be a factor/character/logical).
data: matrix or dataframe with categorical (character/factor/logical) or metric (numeric) predictors.
prior: vector with prior probabilities of the classes. If unspecified, the class proportions for the training set are used. If present, the probabilities should be specified in the order of the factor levels.
laplace: value used for Laplace smoothing (additive smoothing). Defaults to 0 (no Laplace smoothing).
usekernel: logical; if TRUE, density is used to estimate the class conditional densities of metric predictors. This applies to vectors with class "numeric". For further details on interaction between usekernel and usepoisson parameters please see Note below.
usepoisson: logical; if TRUE, Poisson distribution is used to estimate the class conditional PMFs of integer predictors (vectors with class "integer").
subset: an optional vector specifying a subset of observations to be used in the fitting process.
na.action: a function which indicates what should happen when the data contain NAs. By default (na.pass), missing values are not removed from the data and are then omited while constructing tables. Alternatively, na.omit can be used to exclude rows with at least one missing value before constructing tables.
...: other parameters to density when usekernel = TRUE (na.rm defaults to TRUE) (for instance adjust, kernel or bw).
Returns
naive_bayes returns an object of class "naive_bayes" which is a list with following components: - data: list with two components: x (dataframe with predictors) and y (class variable).
levels: character vector with values of the class variable.
laplace: amount of Laplace smoothing (additive smoothing).
tables: list of tables. For each categorical predictor a table with class-conditional probabilities, for each integer predictor a table with Poisson mean (if usepoisson = TRUE) and for each metric predictor a table with a mean and standard deviation or density objects for each class. The object tables contains also an additional attribute "cond_dist" - a character vector with the names of conditional distributions assigned to each feature.
prior: numeric vector with prior probabilities.
usekernel: logical; TRUE, if the kernel density estimation was used for estimating class conditional densities of numeric variables.
usepoisson: logical; TRUE, if the Poisson distribution was used for estimating class conditional PMFs of non-negative integer variables.
call: the call that produced this object.
Details
Numeric (metric) predictors are handled by assuming that they follow Gaussian distribution, given the class label. Alternatively, kernel density estimation can be used (usekernel = TRUE) to estimate their class-conditional distributions. Also, non-negative integer predictors (variables representing counts) can be modelled with Poisson distribution (usepoisson = TRUE); for further details please see Note below. Missing values are not included into constructing tables. Logical variables are treated as categorical (binary) variables.
Note
The class "numeric" contains "double" (double precision floating point numbers) and "integer". Depending on the parameters usekernel and usepoisson different class conditional distributions are applied to columns in the dataset with the class "numeric":
If usekernel=FALSE and poisson=FALSE then Gaussian distribution is applied to each "numeric" variable ("numeric"&"integer" or "numeric"&"double")
If usekernel=TRUE and poisson=FALSE then kernel density estimation (KDE) is applied to each "numeric" variable ("numeric"&"integer" or "numeric"&"double")
If usekernel=FALSE and poisson=TRUE then Gaussian distribution is applied to each "double" vector and Poisson to each "integer" vector. (Gaussian: "numeric" & "double"; Poisson: "numeric" & "integer")
If usekernel=TRUE and poisson=TRUE then kernel density estimation (KDE) is applied to each "double" vector and Poisson to each "integer" vector. (KDE: "numeric" & "double"; Poisson: "numeric" & "integer")
By default usekernel=FALSE and poisson=FALSE, thus Gaussian is applied to each numeric variable.
On the other hand, "character", "factor" and "logical" variables are assigned to the Categorical distribution with Bernoulli being its special case.
Prior the model fitting the classes of columns in the data.frame "data" can be easily checked via:
sapply(data, class)
sapply(data, is.numeric)
sapply(data, is.double)
sapply(data, is.integer)
Examples
### Simulate example datan <-100set.seed(1)data <- data.frame(class = sample(c("classA","classB"), n,TRUE), bern = sample(LETTERS[1:2], n,TRUE), cat = sample(letters[1:3], n,TRUE), logical = sample(c(TRUE,FALSE), n,TRUE), norm = rnorm(n), count = rpois(n, lambda = c(5,15)))train <- data[1:95,]test <- data[96:100,-1]### 1) General usage via formula interfacenb <- naive_bayes(class ~ ., train)summary(nb)# Classificationpredict(nb, test, type ="class")nb %class% test
# Posterior probabilitiespredict(nb, test, type ="prob")nb %prob% test
# Helper functionstables(nb,1)get_cond_dist(nb)# Note: all "numeric" (integer, double) variables are modelled# with Gaussian distribution by default.### 2) General usage via matrix/data.frame and class vectorX <- train[-1]class <- train$class
nb2 <- naive_bayes(x = X, y = class)nb2 %prob% test
### 3) Model continuous variables non-parametrically### via kernel density estimation (KDE)nb_kde <- naive_bayes(class ~ ., train, usekernel =TRUE)summary(nb_kde)get_cond_dist(nb_kde)nb_kde %prob% test
# Visualize class conditional densitiesplot(nb_kde,"norm", arg.num = list(legend.cex =0.9), prob ="conditional")plot(nb_kde,"count", arg.num = list(legend.cex =0.9), prob ="conditional")### ?density and ?bw.nrd for further documentation# 3.1) Change Gaussian kernel to biweight kernelnb_kde_biweight <- naive_bayes(class ~ ., train, usekernel =TRUE, kernel ="biweight")nb_kde_biweight %prob% test
plot(nb_kde_biweight, c("norm","count"), arg.num = list(legend.cex =0.9), prob ="conditional")# 3.2) Change "nrd0" (Silverman's rule of thumb) bandwidth selectornb_kde_SJ <- naive_bayes(class ~ ., train, usekernel =TRUE, bw ="SJ")nb_kde_SJ %prob% test
plot(nb_kde_SJ, c("norm","count"), arg.num = list(legend.cex =0.9), prob ="conditional")# 3.3) Adjust bandwidthnb_kde_adjust <- naive_bayes(class ~ ., train, usekernel =TRUE, adjust =1.5)nb_kde_adjust %prob% test
plot(nb_kde_adjust, c("norm","count"), arg.num = list(legend.cex =0.9), prob ="conditional")### 4) Model non-negative integers with Poisson distributionnb_pois <- naive_bayes(class ~ ., train, usekernel =TRUE, usepoisson =TRUE)summary(nb_pois)get_cond_dist(nb_pois)# Posterior probabilitiesnb_pois %prob% test
# Class conditional distributionsplot(nb_pois,"count", prob ="conditional")# Marginal distributionsplot(nb_pois,"count", prob ="marginal")## Not run:vars <-10rows <-1000000y <- sample(c("a","b"), rows,TRUE)# Only categorical variablesX1 <- as.data.frame(matrix(sample(letters[5:9], vars * rows,TRUE), ncol = vars))nb_cat <- naive_bayes(x = X1, y = y)nb_cat
system.time(pred2 <- predict(nb_cat, X1))## End(Not run)