Fast (Grouped) Distinct Value Count for Matrix-Like Objects
Fast (Grouped) Distinct Value Count for Matrix-Like Objects
fndistinct is a generic function that (column-wise) computes the number of distinct values in x, (optionally) grouped by g. It is significantly faster than length(unique(x)). The TRA argument can further be used to transform x using its (grouped) distinct value count.
fndistinct(x,...)## Default S3 method:fndistinct(x, g =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, nthreads = .op[["nthreads"]],...)## S3 method for class 'matrix'fndistinct(x, g =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, nthreads = .op[["nthreads"]],...)## S3 method for class 'data.frame'fndistinct(x, g =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, nthreads = .op[["nthreads"]],...)## S3 method for class 'grouped_df'fndistinct(x, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, nthreads = .op[["nthreads"]],...)
Arguments
x: a vector, matrix, data frame or grouped data frame (class 'grouped_df').
g: a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.
na.rm: logical. TRUE: Skip missing values in x (faster computation). FALSE: Also consider 'NA' as one distinct value.
use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
nthreads: integer. The number of threads to utilize. Parallelism is across groups for grouped computations and at the column-level otherwise.
drop: matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.
keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation.
...: arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.
Details
fndistinct implements a pretty fast C-level hashing algorithm inspired by the kit package to find the number of distinct values.
If na.rm = TRUE (the default), missing values will be skipped yielding substantial performance gains in data with many missing values. If na.rm = FALSE, missing values will simply be treated as any other value and read into the hash-map. Thus with the former, a numeric vector c(1.25,NaN,3.56,NA) will have a distinct value count of 2, whereas the latter will return a distinct value count of 4.
fndistinct preserves all attributes of non-classed vectors / columns, and only the 'label' attribute (if available) of classed vectors / columns (i.e. dates or factors). When applied to data frames and matrices, the row-names are adjusted as necessary.
Returns
Integer. The number of distinct values in x, grouped by g, or (if TRA is used) x transformed by its distinct value count, grouped by g.
See Also
fnunique, fnobs, Fast Statistical Functions , Collapse Overview
Examples
## default vector methodfndistinct(airquality$Solar.R)# Simple distinct value countfndistinct(airquality$Solar.R, airquality$Month)# Grouped distinct value count## data.frame methodfndistinct(airquality)fndistinct(airquality, airquality$Month)fndistinct(wlddev)# Works with data of all types!head(fndistinct(wlddev, wlddev$iso3c))## matrix methodaqm <- qM(airquality)fndistinct(aqm)# Also works for character or logical matricesfndistinct(aqm, airquality$Month)## method for grouped data frames - created with dplyr::group_by or fgroup_byairquality |> fgroup_by(Month)|> fndistinct()wlddev |> fgroup_by(country)|> fselect(PCGDP,LIFEEX,GINI,ODA)|> fndistinct()