Fast (Grouped, Weighted) Sum for Matrix-Like Objects
Fast (Grouped, Weighted) Sum for Matrix-Like Objects
fsum is a generic function that computes the (column-wise) sum of all values in x, (optionally) grouped by g and/or weighted by w (e.g. to calculate survey totals). The TRA argument can further be used to transform x using its (grouped, weighted) sum.
fsum(x,...)## Default S3 method:fsum(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, fill =FALSE, nthreads = .op[["nthreads"]],...)## S3 method for class 'matrix'fsum(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, fill =FALSE, nthreads = .op[["nthreads"]],...)## S3 method for class 'data.frame'fsum(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, fill =FALSE, nthreads = .op[["nthreads"]],...)## S3 method for class 'grouped_df'fsum(x, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, keep.w =TRUE, stub = .op[["stub"]], fill =FALSE, nthreads = .op[["nthreads"]],...)
Arguments
x: a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
g: a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.
w: a numeric vector of (non-negative) weights, may contain missing values.
na.rm: logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.
use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
fill: logical. Initialize result with 0 instead of NA when na.rm = TRUE e.g. fsum(NA, fill = TRUE) returns 0 instead of NA.
nthreads: integer. The number of threads to utilize. See Details.
drop: matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.
keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation.
keep.w: grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df).
stub: character. If keep.w = TRUE and stub = TRUE (default), the summed weights column is prefixed by "sum.". Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing.
...: arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.
Details
The weighted sum (e.g. survey total) is computed as sum(x * w), but in one pass and about twice as efficient. If na.rm = TRUE, missing values will be removed from both x and w i.e. utilizing only x[complete.cases(x,w)] and w[complete.cases(x,w)].
This all seamlessly generalizes to grouped computations, which are performed in a single pass (without splitting the data) and are therefore extremely fast. See Benchmark and Examples below.
When applied to data frames with groups or drop = FALSE, fsum preserves all column attributes. The attributes of the data frame itself are also preserved.
Since v1.6.0 fsum explicitly supports integers. Integers are summed using the long long type in C which is bounded at +-9,223,372,036,854,775,807 (so ~4.3 billion times greater than the minimum/maximum R integer bounded at +-2,147,483,647). If the value of the sum is outside +-2,147,483,647, a double containing the result is returned, otherwise an integer is returned. With groups, an integer results vector is initialized, and an integer overflow error is provided if the sum in any group is outside +-2,147,483,647. Data needs to be coerced to double beforehand in such cases.
Multithreading, added in v1.8.0, applies at the column-level unless g = NULL and nthreads > NCOL(x). Parallelism over groups is not available because sums are computed simultaneously within each group. nthreads = 1L uses a serial version of the code, not parallel code running on one thread. This serial code is always used with less than 100,000 obs (length(x) < 100000 for vectors and matrices), because parallel execution itself has some overhead.
Returns
The (w weighted) sum of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) sum.
See Also
fprod, fmean, Fast Statistical Functions , Collapse Overview
Examples
## default vector methodmpg <- mtcars$mpg
fsum(mpg)# Simple sumfsum(mpg, w = mtcars$hp)# Weighted sum (total): Weighted by hpfsum(mpg, TRA ="%")# Simple transformation: obtain percentages of mpgfsum(mpg, mtcars$cyl)# Grouped sumfsum(mpg, mtcars$cyl, mtcars$hp)# Weighted grouped sum (total)fsum(mpg, mtcars[c(2,8:9)])# More groups..g <- GRP(mtcars,~ cyl + vs + am)# Precomputing groups gives more speed !fsum(mpg, g)fmean(mpg, g)== fsum(mpg, g)/ fnobs(mpg, g)fsum(mpg, g, TRA ="%")# Percentages by group## data.frame methodfsum(mtcars)fsum(mtcars, TRA ="%")fsum(mtcars, g)fsum(mtcars, g, TRA ="%")## matrix methodm <- qM(mtcars)fsum(m)fsum(m, TRA ="%")fsum(m, g)fsum(m, g, TRA ="%")## method for grouped data frames - created with dplyr::group_by or fgroup_bymtcars |> fgroup_by(cyl,vs,am)|> fsum(hp)# Weighted grouped sum (total)mtcars |> fgroup_by(cyl,vs,am)|> fsum(TRA ="%")mtcars |> fgroup_by(cyl,vs,am)|> fselect(mpg)|> fsum()## This compares fsum with data.table and base::rowsum# Starting with small datalibrary(data.table)opts <- set_collapse(nthreads = getDTthreads())mtcDT <- qDT(mtcars)f <- qF(mtcars$cyl)library(microbenchmark)microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder =FALSE), fsum(mtcDT, f, na.rm =FALSE), unit ="relative")# Now larger datatdata <- qDT(replicate(100, rnorm(1e5), simplify =FALSE))# 100 columns with 100.000 obsf <- qF(sample.int(1e4,1e5,TRUE))# A factor with 10.000 groupsmicrobenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder =FALSE), fsum(tdata, f, na.rm =FALSE), unit ="relative")# Reset optionsset_collapse(opts)