Fast (Grouped, Weighted) Variance and Standard Deviation for Matrix-Like Objects
Fast (Grouped, Weighted) Variance and Standard Deviation for Matrix-Like Objects
fvar and fsd are generic functions that compute the (column-wise) variance and standard deviation of x, (optionally) grouped by g and/or frequency-weighted by w. The TRA argument can further be used to transform x using its (grouped, weighted) variance/sd.
fvar(x,...)fsd(x,...)## Default S3 method:fvar(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, stable.algo = .op[["stable.algo"]],...)## Default S3 method:fsd(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, stable.algo = .op[["stable.algo"]],...)## S3 method for class 'matrix'fvar(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, stable.algo = .op[["stable.algo"]],...)## S3 method for class 'matrix'fsd(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, stable.algo = .op[["stable.algo"]],...)## S3 method for class 'data.frame'fvar(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, stable.algo = .op[["stable.algo"]],...)## S3 method for class 'data.frame'fsd(x, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, stable.algo = .op[["stable.algo"]],...)## S3 method for class 'grouped_df'fvar(x, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, keep.w =TRUE, stub = .op[["stub"]], stable.algo = .op[["stable.algo"]],...)## S3 method for class 'grouped_df'fsd(x, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, keep.w =TRUE, stub = .op[["stub"]], stable.algo = .op[["stable.algo"]],...)
Arguments
x: a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
g: a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.
w: a numeric vector of (non-negative) weights, may contain missing values.
na.rm: logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.
use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
drop: matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.
keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation.
keep.w: grouped_df method: Logical. Retain summed weighting variable after computation (if contained in grouped_df).
stub: character. If keep.w = TRUE and stub = TRUE (default), the summed weights column is prefixed by "sum.". Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing.
stable.algo: logical. TRUE (default) use Welford's numerically stable online algorithm. FALSE implements a faster but numerically unstable one-pass method. See Details.
...: arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.
Details
Welford's online algorithm used by default to compute the variance is well described here (the section Weighted incremental algorithm also shows how the weighted variance is obtained by this algorithm).
If stable.algo = FALSE, the variance is computed in one-pass as (sum(x^2)-n*mean(x)^2)/(n-1), where sum(x^2) is the sum of squares from which the expected sum of squares n*mean(x)^2 is subtracted, normalized by n-1 (Bessel's correction). This is numerically unstable if sum(x^2) and n*mean(x)^2 are large numbers very close together, which will be the case for large n, large x-values and small variances (catastrophic cancellation occurs, leading to a loss of numeric precision). Numeric precision is however still maximized through the internal use of long doubles in C++, and the fast algorithm can be up to 4-times faster compared to Welford's method.
The weighted variance is computed with frequency weights as (sum(x^2*w)-sum(w)*weighted.mean(x,w)^2)/(sum(w)-1). If na.rm = TRUE, missing values will be removed from both x and w i.e. utilizing only x[complete.cases(x,w)] and w[complete.cases(x,w)].
For further computational detail see fsum.
Returns
fvar returns the (w weighted) variance of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) variance. fsd computes the standard deviation of x in like manor.
References
Welford, B. P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics. 4 (3): 419-420. doi:10.2307/1266577.
See Also
Fast Statistical Functions , Collapse Overview
Examples
## default vector methodfvar(mtcars$mpg)# Simple variance (all examples also hold for fvar!)fsd(mtcars$mpg)# Simple standard deviationfsd(mtcars$mpg, w = mtcars$hp)# Weighted sd: Weighted by hpfsd(mtcars$mpg, TRA ="/")# Simple transformation: scaling (See also ?fscale)fsd(mtcars$mpg, mtcars$cyl)# Grouped sdfsd(mtcars$mpg, mtcars$cyl, mtcars$hp)# Grouped weighted sdfsd(mtcars$mpg, mtcars$cyl, TRA ="/")# Scaling by groupfsd(mtcars$mpg, mtcars$cyl, mtcars$hp,"/")# Group-scaling using weighted group sds## data.frame methodfsd(iris)# This works, although 'Species' is a factor variablefsd(mtcars, drop =FALSE)# This works, all columns are numeric variablesfsd(iris[-5], iris[5])# By Species: iris[5] is still a list, and thus passed to GRP()fsd(iris[-5], iris[[5]])# Same thing much faster: fsd recognizes 'Species' is a factorhead(fsd(iris[-5], iris[[5]], TRA ="/"))# Data scaled by species (see also fscale)## matrix methodm <- qM(mtcars)fsd(m)fsd(m, mtcars$cyl)# etc..## method for grouped data frames - created with dplyr::group_by or fgroup_bymtcars |> fgroup_by(cyl,vs,am)|> fsd()mtcars |> fgroup_by(cyl,vs,am)|> fsd(keep.group_vars =FALSE)# Remove grouping columnsmtcars |> fgroup_by(cyl,vs,am)|> fsd(hp)# Weighted by hpmtcars |> fgroup_by(cyl,vs,am)|> fsd(hp,"/")# Weighted scaling transformation