Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects
Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects
fnth (column-wise) returns the n'th smallest element from a set of unsorted elements x corresponding to an integer index (n), or to a probability between 0 and 1. If n is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or (default) continuous quantile estimation. For n > 1, the lower element is always returned (as in sort(x, partial = n)[n]). See Details.
fmedian is a simple wrapper around fnth, which fixes n = 0.5 and (default) ties = "mean", i.e., it averages eligible elements. See Details.
fnth(x, n =0.5,...)fmedian(x,...)## Default S3 method:fnth(x, n =0.5, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, ties ="q7", nthreads = .op[["nthreads"]], o =NULL, check.o = is.null(attr(o,"sorted")),...)## Default S3 method:fmedian(x,..., ties ="mean")## S3 method for class 'matrix'fnth(x, n =0.5, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, ties ="q7", nthreads = .op[["nthreads"]],...)## S3 method for class 'matrix'fmedian(x,..., ties ="mean")## S3 method for class 'data.frame'fnth(x, n =0.5, g =NULL, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =TRUE, drop =TRUE, ties ="q7", nthreads = .op[["nthreads"]],...)## S3 method for class 'data.frame'fmedian(x,..., ties ="mean")## S3 method for class 'grouped_df'fnth(x, n =0.5, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, keep.w =TRUE, stub = .op[["stub"]], ties ="q7", nthreads = .op[["nthreads"]],...)## S3 method for class 'grouped_df'fmedian(x, w =NULL, TRA =NULL, na.rm = .op[["na.rm"]], use.g.names =FALSE, keep.group_vars =TRUE, keep.w =TRUE, stub = .op[["stub"]], ties ="mean", nthreads = .op[["nthreads"]],...)
Arguments
x: a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').
n: the element to return using a single integer index such that 1 < n < NROW(x), or a probability 0 < n < 1. See Details.
g: a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.
w: a numeric vector of (non-negative) weights, may contain missing values only where x is also missing.
na.rm: logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.
use.g.names: logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.
ties: an integer or character string specifying the method to resolve ties between adjacent qualifying elements:
Int.
String
Description
1
"mean"
take the arithmetic mean of all qualifying elements.
2
"min"
take the smallest of the elements.
3
"max"
take the largest of the elements.
4-9
"qn"
continuous quantile types 4-9, see fquantile .
nthreads: integer. The number of threads to utilize. Parallelism is across groups for grouped computations on vectors and data frames, and at the column-level otherwise. See Details.
o: integer. A valid ordering of x, e.g. radixorder(x). With groups, the grouping needs to be accounted e.g. radixorder(g, x).
check.o: logical. TRUE checks that each element of o is within [1, length(x)]. The default uses the fact that orderings from radixorder have a "sorted" attribute which let's fnth infer that the ordering is valid. The length and data type of o is always checked, regardless of check.o.
drop: matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.
keep.group_vars: grouped_df method: Logical. FALSE removes grouping variables after computation.
keep.w: grouped_df method: Logical. Retain sum of weighting variable after computation (if contained in grouped_df).
stub: character. If keep.w = TRUE and stub = TRUE (default), the summed weights column is prefixed by "sum.". Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing.
...: for fmedian: further arguments passed to fnth (apart from n). If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.
Details
fnth uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading:
without weights, quickselect is used to determine a (lower) order statistic. If ties %!in% c("min", "max") a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (ties = "mean"), or weighted average according to a quantile method (ties = "q4"-"q9"). For n = 0.5, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless is.null(g) for data frames.
with weights and no groups (is.null(g)), radixorder is called internally (on each column of x). The ordering is used to sum the weights in order of x and determine weighted order statistics or quantiles. See details below. Multithreading is disabled as radixorder cannot be called concurrently on the same memory stack.
with weights and groups (!is.null(g)), R's quicksort algorithm is used to sort the data in each group and return an index which can be used to sum the weights in order and proceed as before. This is multithreaded across columns for matrices, and across groups otherwise.
in fnth.default, an ordering of x can be supplied to 'o' e.g. fnth(x, 0.75, o = radixorder(x)). This dramatically speeds up the estimation both with and without weights, and is useful if fnth is to be invoked repeatedly on the same data. With groups, o needs to also account for the grouping e.g. fnth(x, 0.75, g, o = radixorder(g, x)). Multithreading is possible across groups. See Examples.
If n > 1, the result is equivalent to (column-wise) sort(x, partial = n)[n]. Internally, n is converted to a probability using p = (n-1)/(NROW(x)-1), and that probability is applied to the set of non-missing elements to find the as.integer(p*(fnobs(x)-1))+1L'th element (which corresponds to option ties = "min").
When using grouped computations with n > 1, n is transformed to a probability p = (n-1)/(NROW(x)/ng-1) (where ng contains the number of unique groups in g).
If weights are used and ties = "q4"-"q9", weighted continuous quantile estimation is done as described in fquantile.
For ties %in% c("mean", "min", "max"), a target partial sum of weights p*sum(w) is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights <= p*sum(w), and all elements larger than k have a sum of weights <= (1 - p)*sum(w). If the partial-sum of weights (p*sum(w)) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If n > 1, k is chosen (consistent with the unweighted behavior).
If 0 < n < 1, the ties option regulates how to resolve such conflicts, yielding lower (ties = "min": k), upper (ties = "max": k+2) or average weighted (ties = "mean": mean(k, k+1, k+2)) n'th elements.
Thus, in the presence of zero weights, the weighted median (default ties = "mean") can be an arithmetic average of >2 qualifying elements.
For data frames, column-attributes and overall attributes are preserved if g is used or drop = FALSE.
Returns
The (w weighted) n'th element/quantile of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) n'th element/quantile.
See Also
fquantile, fmean, fmode, Fast Statistical Functions , Collapse Overview
Examples
## default vector methodmpg <- mtcars$mpg
fnth(mpg)# Simple nth element: Median (same as fmedian(mpg))fnth(mpg,5)# 5th smallest elementsort(mpg, partial =5)[5]# Same using base R, fnth is 2x faster.fnth(mpg,0.75)# Third quartilefnth(mpg,0.75, w = mtcars$hp)# Weighted third quartile: Weighted by hpfnth(mpg,0.75, TRA ="-")# Simple transformation: Subtract third quartilefnth(mpg,0.75, mtcars$cyl)# Grouped third quartilefnth(mpg,0.75, mtcars[c(2,8:9)])# More groups..g <- GRP(mtcars,~ cyl + vs + am)# Precomputing groups gives more speed !fnth(mpg,0.75, g)fnth(mpg,0.75, g, mtcars$hp)# Grouped weighted third quartilefnth(mpg,0.75, g, TRA ="-")# Groupwise subtract third quartilefnth(mpg,0.75, g, mtcars$hp,"-")# Groupwise subtract weighted third quartile## data.frame methodfnth(mtcars,0.75)head(fnth(mtcars,0.75, TRA ="-"))fnth(mtcars,0.75, g)fnth(fgroup_by(mtcars, cyl, vs, am),0.75)# Another way of doing it..fnth(mtcars,0.75, g, use.g.names =FALSE)# No row-names generated## matrix methodm <- qM(mtcars)fnth(m,0.75)head(fnth(m,0.75, TRA ="-"))fnth(m,0.75, g)# etc..## method for grouped data frames - created with dplyr::group_by or fgroup_bymtcars |> fgroup_by(cyl,vs,am)|> fnth(0.75)mtcars |> fgroup_by(cyl,vs,am)|> fnth(0.75, hp)# Weightedmtcars |> fgroup_by(cyl,vs,am)|> fnth(0.75, TRA ="/")# Divide by third quartilemtcars |> fgroup_by(cyl,vs,am)|> fselect(mpg, hp)|># Faster selecting fnth(0.75, hp,"/")# Divide mpg by its third weighted group-quartile, using hp as weights# Efficient grouped estimation of multiple quantilesmtcars |> fgroup_by(cyl,vs,am)|> fmutate(o = radixorder(GRPid(), mpg))|> fsummarise(mpg_Q1 = fnth(mpg,0.25, o = o), mpg_median = fmedian(mpg, o = o), mpg_Q3 = fnth(mpg,0.75, o = o))## fmedian()fmedian(mpg)# Simple median valuefmedian(mpg, w = mtcars$hp)# Weighted median: Weighted by hpfmedian(mpg, TRA ="-")# Simple transformation: Subtract median valuefmedian(mpg, mtcars$cyl)# Grouped median valuefmedian(mpg, mtcars[c(2,8:9)])# More groups..fmedian(mpg, g)fmedian(mpg, g, mtcars$hp)# Grouped weighted medianfmedian(mpg, g, TRA ="-")# Groupwise subtract median valuefmedian(mpg, g, mtcars$hp,"-")# Groupwise subtract weighted median value## data.frame methodfmedian(mtcars)head(fmedian(mtcars, TRA ="-"))fmedian(mtcars, g)fmedian(fgroup_by(mtcars, cyl, vs, am))# Another way of doing it..fmedian(mtcars, g, use.g.names =FALSE)# No row-names generated## matrix methodfmedian(m)head(fmedian(m, TRA ="-"))fmedian(m, g)# etc..## method for grouped data frames - created with dplyr::group_by or fgroup_bymtcars |> fgroup_by(cyl,vs,am)|> fmedian()mtcars |> fgroup_by(cyl,vs,am)|> fmedian(hp)# Weightedmtcars |> fgroup_by(cyl,vs,am)|> fmedian(TRA ="-")# De-medianmtcars |> fgroup_by(cyl,vs,am)|> fselect(mpg, hp)|># Faster selecting fmedian(hp,"-")# Weighted de-median mpg, using hp as weights