sum_by performs an efficient and optionally weighted by-group summation by using linear algebra and the Matrix package capabilities. The by-group summation is performed through matrix cross-product of the y parameter (coerced to a matrix if needed) with a (very) sparse matrix built up using the by and the (optional) w parameters.
Compared to base R, dplyr or data.table alternatives, this implementation aims at being easier to use in a matrix-oriented context and can yield efficiency gains when the number of columns becomes high.
sum_by(y, by, w =NULL, na_rm =TRUE, keep_sparse =FALSE)
Arguments
y: A (sparse) vector, a (sparse) matrix or a data.frame. The object to perform by-group summation on.
by: The factor variable defining the by-groups. Character variables are coerced to factors.
w: The optional row weights to be used in the summation.
na_rm: Should NA values in y be removed (ie treated as 0 in the summation) ? Similar to na.rm argument in sum, but TRUE by default. If FALSE, NA values in y produce NA values in the result.
keep_sparse: When y is a sparse vector or a sparse matrix, should the result also be sparse ? FALSE by default. As sparseVector-class does not have a name attribute, when y is a sparseVector the result does not have any name (and a warning is cast).
Returns
A vector, a matrix or a data.frame depending on the type of y. If y is sparse and keep_sparse = TRUE, then the result is also sparse (without names when it is a sparse vector, see keep_sparse argument for details).
Examples
# Data generationset.seed(1)n <-100p <-10H <-3y <- matrix(rnorm(n*p), ncol = p, dimnames = list(NULL, paste0("var",1:10)))y[1,1]<-NAby <- letters[sample.int(H, n, replace =TRUE)]w <- rep(1, n)w[by =="a"]<-2# Standard usesum_by(y, by)# Keeping the NAssum_by(y, by, na_rm =FALSE)# With a weightsum_by(y, by, w = w)