BY function

Split-Apply-Combine Computing

Split-Apply-Combine Computing

BY is an S3 generic that efficiently applies functions over vectors or matrix- and data frame columns by groups. Similar to dapply it seeks to retain the structure and attributes of the data, but can also output to various standard formats. A simple parallelism is also available.

BY(x, ...) ## Default S3 method: BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE, expand.wide = FALSE, parallel = FALSE, mc.cores = 1L, return = c("same", "vector", "list")) ## S3 method for class 'matrix' BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE, expand.wide = FALSE, parallel = FALSE, mc.cores = 1L, return = c("same", "matrix", "data.frame", "list")) ## S3 method for class 'data.frame' BY(x, g, FUN, ..., use.g.names = TRUE, sort = .op[["sort"]], reorder = TRUE, expand.wide = FALSE, parallel = FALSE, mc.cores = 1L, return = c("same", "matrix", "data.frame", "list")) ## S3 method for class 'grouped_df' BY(x, FUN, ..., reorder = TRUE, keep.group_vars = TRUE, use.g.names = FALSE)

Arguments

  • x: a vector, matrix, data frame or alike object.
  • g: a GRP object, or a factor / atomic vector / list of atomic vectors (internally converted to a GRP object) used to group x.
  • FUN: a function, can be scalar- or vector-valued. For vector valued functions see also reorder and expand.wide.
  • ...: further arguments to FUN, or to BY.data.frame for the 'grouped_df' method. Since v1.9.0 data length arguments are also split by groups.
  • use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). For vector-valued functions (row-)names are only generated if the function itself creates names for the statistics e.g. quantile() adds names, range() or log() don't. No row-names are generated on data.table's.
  • sort: logical. Sort the groups? Internally passed to GRP, and only effective if g is not already a factor or GRP object.
  • reorder: logical. If a vector-valued function is passed that preserves the data length, TRUE will reorder the result such that the elements/rows match the original data. FALSE just combines the data in order of the groups (i.e. all elements of the first group in first-appearance order followed by all elements in the second group etc..). Note that if reorder = FALSE, grouping variables, names or rownames are only retained if the grouping is on sorted data, see GRP.
  • expand.wide: logical. If FUN is a vector-valued function returning a vector of fixed length > 1 (such as the quantile function), expand.wide can be used to return the result in a wider format (instead of stacking the resulting vectors of fixed length above each other in each output column).
  • parallel: logical. TRUE implements simple parallel execution by internally calling mclapply instead of lapply. Parallelism is across columns, except for the default method.
  • mc.cores: integer. Argument to mclapply indicating the number of cores to use for parallel execution. Can use detectCores() to select all available cores.
  • return: an integer or string indicating the type of object to return. The default 1 - "same" returns the same object type (i.e. class and other attributes are retained if the underlying data type is the same, just the names for the dimensions are adjusted). 2 - "matrix" always returns the output as matrix, 3 - "data.frame" always returns a data frame and 4 - "list" returns the raw (uncombined) output. Note: 4 - "list" works together with expand.wide to return a list of matrices.
  • keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation. See also the Note.

Details

BY is a re-implementation of the Split-Apply-Combine computing paradigm. It is faster than tapply, by, aggregate and (d)plyr, and preserves data attributes just like dapply.

It is principally a wrapper around lapply(gsplit(x, g), FUN, ...), that uses gsplit for optimized splitting and also strongly optimizes on the internal code compared to base R functions. For more details look at the documentation for dapply which works very similar (apart from the splitting performed in BY). The function is intended for simple cases involving flexible computation of statistics across groups using a single function e.g. iris |> gby(Species) |> BY(IQR) is simpler than iris |> gby(Species) |> smr(acr(.fns = IQR)) etc..

Returns

X where FUN was applied to every column split by g.

See Also

dapply, collap, Fast Statistical Functions , Data Transformations , Collapse Overview

Examples

v <- iris$Sepal.Length # A numeric vector g <- GRP(iris$Species) # A grouping ## default vector method BY(v, g, sum) # Sum by species head(BY(v, g, scale)) # Scale by species (please use fscale instead) BY(v, g, fquantile) # Species quantiles: by default stacked BY(v, g, fquantile, expand.wide = TRUE) # Wide format ## matrix method m <- qM(num_vars(iris)) BY(m, g, sum) # Also return as matrix BY(m, g, sum, return = "data.frame") # Return as data.frame.. also works for computations below head(BY(m, g, scale)) BY(m, g, fquantile) BY(m, g, fquantile, expand.wide = TRUE) ml <- BY(m, g, fquantile, expand.wide = TRUE, # Return as list of matrices return = "list") ml # Unlisting to Data Frame unlist2d(ml, idcols = "Variable", row.names = "Species") ## data.frame method BY(num_vars(iris), g, sum) # Also returns a data.fram BY(num_vars(iris), g, sum, return = 2) # Return as matrix.. also works for computations below head(BY(num_vars(iris), g, scale)) BY(num_vars(iris), g, fquantile) BY(num_vars(iris), g, fquantile, expand.wide = TRUE) BY(num_vars(iris), g, fquantile, # Return as list of matrices expand.wide = TRUE, return = "list") ## grouped data frame method giris <- fgroup_by(iris, Species) giris |> BY(sum) # Compute sum giris |> BY(sum, use.g.names = TRUE, # Use row.names and keep.group_vars = FALSE) # remove 'Species' and groups attribute giris |> BY(sum, return = "matrix") # Return matrix giris |> BY(sum, return = "matrix", # Matrix with row.names use.g.names = TRUE) giris |> BY(.quantile) # Compute quantiles (output is stacked) giris |> BY(.quantile, names = TRUE, # Wide output expand.wide = TRUE)
  • Maintainer: Sebastian Krantz
  • License: GPL (>= 2) | file LICENSE
  • Last published: 2025-03-10