fndistinct function

Fast (Grouped) Distinct Value Count for Matrix-Like Objects

Fast (Grouped) Distinct Value Count for Matrix-Like Objects

fndistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.

fndistinct(x, ...) ## Default S3 method: fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]], use.g.names = TRUE, nthreads = .op[["nthreads"]], ...) ## S3 method for class 'matrix' fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]], use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...) ## S3 method for class 'data.frame' fndistinct(x, g = NULL, TRA = NULL, na.rm = .op[["na.rm"]], use.g.names = TRUE, drop = TRUE, nthreads = .op[["nthreads"]], ...) ## S3 method for class 'grouped_df' fndistinct(x, TRA = NULL, na.rm = .op[["na.rm"]], use.g.names = FALSE, keep.group_vars = TRUE, nthreads = .op[["nthreads"]], ...)

Arguments

  • x: a vector, matrix, data frame or grouped data frame (class 'grouped_df').
  • g: a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.
  • TRA: an integer or quoted operator indicating the transformation to perform: 0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.
  • na.rm: logical. TRUE: Skip missing values in x (faster computation). FALSE: Also consider 'NA' as one distinct value.
  • use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
  • nthreads: integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.
  • drop: matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.
  • keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation.
  • ...: arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.

Details

fndistinct implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.

If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.

fndistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.

Returns

Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.

See Also

fnunique, fnobs, Fast Statistical Functions , Collapse Overview

Examples

## default vector method fndistinct(airquality$Solar.R) # Simple distinct value count fndistinct(airquality$Solar.R, airquality$Month) # Grouped distinct value count ## data.frame method fndistinct(airquality) fndistinct(airquality, airquality$Month) fndistinct(wlddev) # Works with data of all types! head(fndistinct(wlddev, wlddev$iso3c)) ## matrix method aqm <- qM(airquality) fndistinct(aqm) # Also works for character or logical matrices fndistinct(aqm, airquality$Month) ## method for grouped data frames - created with dplyr::group_by or fgroup_by airquality |> fgroup_by(Month) |> fndistinct() wlddev |> fgroup_by(country) |> fselect(PCGDP,LIFEEX,GINI,ODA) |> fndistinct()
  • Maintainer: Sebastian Krantz
  • License: GPL (>= 2) | file LICENSE
  • Last published: 2025-03-10