Small Functions to Make R Programming More Efficient
Small Functions to Make R Programming More Efficient
A small set of functions to address some common inefficiencies in R, such as the creation of logical vectors to compare quantities, unnecessary copies of objects in elementary mathematical or subsetting operations, obtaining information about objects (esp. data frames), or dealing with missing values.
anyv(x, value)# Faster than any(x == value). See also kit::panyv()allv(x, value)# Faster than all(x == value). See also kit::pallv()allNA(x)# Faster than all(is.na(x)). See also kit::pallNA()whichv(x, value,# Faster than which(x == value) invert =FALSE)# or which(x != value). See also Note (3)whichNA(x, invert =FALSE)# Faster than which((!)is.na(x))x %==% value # Infix for whichv(v, value, FALSE), use e.g. in fsubset()x %!=% value # Infix for whichv(v, value, TRUE). See also Note (3)alloc(value, n,# Fast rep_len(value, n) or replicate(n, value). simplify =TRUE)# simplify only works if length(value) == 1. See Details.copyv(X, v, R,..., invert # Fast replace(X, v, R), replace(X, X (!/=)= v, R) or=FALSE, vind1 =FALSE,# replace(X, (!)v, R[(!)v]). See Details and Note (4). xlist =FALSE)# For multi-replacement see also kit::vswitch()setv(X, v, R,..., invert # Same for X[v] <- r, X[x (!/=)= v] <- r or=FALSE, vind1 =FALSE,# x[(!)v] <- r[(!)v]. Modifies X by reference, fastest. xlist =FALSE)# X/R/V can also be lists/DFs. See Details and Examples.setop(X, op, V,...,# Faster than X <- X +\-\*\/ V (modifies by reference) rowwise =FALSE)# optionally can also add v to rows of a matrix or listX %+=% V # Infix for setop(X, "+", V). See also Note (2)X %-=% V # Infix for setop(X, "-", V). See also Note (2)X %*=% V # Infix for setop(X, "*", V). See also Note (2)X %/=% V # Infix for setop(X, "/", V). See also Note (2)na_rm(x)# Fast: if(anyNA(x)) x[!is.na(x)] else x, lastna_locf(x, set =FALSE)# obs. carried forward and first obs. carried back.na_focb(x, set =FALSE)# (by reference). These also support lists (NULL/empty)na_omit(X, cols =NULL,# Faster na.omit for matrices and data frames, na.attr =FALSE,# can use selected columns to check, attach indices, prop =0,...)# and remove cases with a proportion of values missingna_insert(X, prop =0.1,# Insert missing values at random value =NA)missing_cases(X, cols=NULL,# The opposite of complete.cases(), faster for DF's. prop =0, count =FALSE)# See also kit::panyNA(), kit::pallNA(), kit::pcountNA()vlengths(X, use.names=TRUE)# Faster lengths() and nchar() (in C, no method dispatch)vtypes(X, use.names =TRUE)# Get data storage types (faster vapply(X, typeof, ...))vgcd(x)# Greatest common divisor of positive integers or doublesfnlevels(x)# Faster version of nlevels(x) (for factors)fnrow(X)# Faster nrow for data frames (not faster for matrices)fncol(X)# Faster ncol for data frames (not faster for matrices)fdim(X)# Faster dim for data frames (not faster for matrices)seq_row(X)# Fast integer sequences along rows of Xseq_col(X)# Fast integer sequences along columns of Xvec(X)# Vectorization (stacking) of matrix or data frame/listcinv(x)# Choleski (fast) inverse of symmetric PD matrix, e.g. X'X
Arguments
X, V, R: a vector, matrix or data frame.
x, v: a (atomic) vector or matrix (na_rm also supports lists).
value: a single value of any (atomic) vector type. For whichv it can also be a length(x) vector.
invert: logical. TRUE considers elements x != value.
set: logical. TRUE transforms x by reference.
simplify: logical. If value is a length-1 atomic vector, alloc() with simplify = TRUE returns a length-n atomic vector. If simplify = FALSE, the result is always a list.
vind1: logical. If length(v) == 1L, setting vind1 = TRUE will interpret v as an index, rather than a value to search and replace.
xlist: logical. If X is a list, the default is to treat it like a data frame and replace rows. Setting xlist = TRUE will treat X and its replacement R like 1-dimensional list vectors.
op: an integer or character string indicating the operation to perform.
Int.
String
Description
1
"+"
add V
2
"-"
subtract V
3
"*"
multiply by V
4
"/"
divide by V
rowwise: logical. TRUE performs the operation between V and each row of X.
cols: select columns to check for missing values using column names, indices, a logical vector or a function (e.g. is.numeric). The default is to check all columns, which could be inefficient.
n: integer. The length of the vector to allocate with value.
na.attr: logical. TRUE adds an attribute containing the removed cases. For compatibility reasons this is exactly the same format as na.omit i.e. the attribute is called "na.action" and of class "omit".
prop: double. For na_insert: the proportion of observations to be randomly replaced with NA. For missing_cases and na_omit: the proportion of values missing for the case to be considered missing (within cols if specified). For matrices this is implemented in R as rowSums(is.na(X)) >= max(as.integer(prop * ncol(X)), 1L). The C code for data frames works equivalently, and skips list- and raw-columns (ncol(X) is adjusted downwards).
count: logical. TRUE returns the row-wise missing value count (within cols). This ignores prop.
use.names: logical. Preserve names if X is a list.
...: for na_omit: further arguments passed to [ for vectors and matrices. With indexed data it is also possible to specify the drop.index.levels argument, see indexing . For copyv, setv and setop, the argument is unused, and serves as a placeholder for possible future arguments.
Details
alloc is a fusion of rep_len and replicate that is faster in both cases. If value is a length one atomic vector (logical, integer, double, string, complex or raw) and simplify = TRUE, the functionality is as rep_len(value, n) i.e. the output is a length n atomic vector with the same attributes as value (apart from "names", "dim" and "dimnames"). For all other cases the functionality is as replicate(n, value, simplify = FALSE) i.e. the output is a length-n list of the objects. For efficiency reasons the object is not copied i.e. only the pointer to the object is replicated.
copyv and setv are designed to optimize operations that require replacing data in objects in the broadest sense. The only difference between them is that copyv first deep-copies X before doing replacements whereas setv modifies X in place and returns the result invisibly. There are 3 ways these functions can be used:
To replace a single value, setv(X, v, R) is an efficient alternative to X[X == v] \<- R, and copyv(X, v, R) is more efficient than replace(X, X == v, R). This can be inverted using setv(X, v, R, invert = TRUE), equivalent to X[X != v] \<- R.
To do standard replacement with integer or logical indices i.e. X[v] \<- R is more efficient using setv(X, v, R), and, if v is logical, setv(X, v, R, invert = TRUE) is efficient for X[!v] \<- R. To distinguish this from use case (1) when length(v) == 1, the argument vind1 = TRUE can be set to ensure that v is always interpreted as an index.
To copy values from objects of equal size i.e. setv(X, v, R) is faster than X[v] \<- R[v], and setv(X, v, R, invert = TRUE) is faster than X[!v] \<- R[!v].
Both X and R can be atomic or data frames / lists. If X is a list, the default behavior is to interpret it like a data frame, and apply setv/copyv to each element/column of X. If R is also a list, this is done using mapply. Thus setv/copyv can also be used to replace elements or rows in data frames, or copy rows from equally sized frames. Note that for replacing subsets in data frames set from data.table provides a more convenient interface (and there is also copy if you just want to deep-copy an object without any modifications to it).
If X should not be interpreted like a data frame, setting xlist = TRUE will interpret it like a 1D list-vector analogous to atomic vectors, except that use case (1) is not permitted i.e. no value comparisons on list elements.
Note
None of these functions (apart from alloc) currently support complex vectors.
setop and the operators %+=%, %-=%, %*=% and %/=% also work with integer data, but do not perform any integer related checks. R's integers are bounded between +-2,147,483,647 and NA_integer_ is stored as the value -2,147,483,648. Thus computations resulting in values exceeding +-2,147,483,647 will result in integer overflows, and NA_integer_ should not occur on either side of a setop call. These are programmers functions and meant to provide the most efficient math possible to responsible users.
It is possible to compare factors by the levels (e.g. iris$Species %==% "setosa")) or using integers (iris$Species %==% 1L). The latter is slightly more efficient. Nothing special is implemented for other objects apart from basic types, e.g. for dates (which are stored as doubles) you need to generate a date object i.e. wlddev$date %==% as.Date("2019-01-01"). Using wlddev$date %==% "2019-01-01" will give integer(0).
setv/copyv only allow positive integer indices being passed to v, and, for efficiency reasons, they only check the first and the last index. Thus if there are indices in the middle that fall outside of the data range it will terminate R.
See Also
Data Transformations , Small (Helper) Functions , Collapse Overview
Examples
oldopts <- options(max.print =70)## Which valuewhichNA(wlddev$PCGDP)# Same as which(is.na(wlddev$PCGDP))whichNA(wlddev$PCGDP, invert =TRUE)# Same as which(!is.na(wlddev$PCGDP))whichv(wlddev$country,"Chad")# Same as which(wlddev$county == "Chad")wlddev$country %==%"Chad"# Same thingwhichv(wlddev$country,"Chad",TRUE)# Same as which(wlddev$county != "Chad")wlddev$country %!=%"Chad"# Same thinglvec <- wlddev$country =="Chad"# If we already have a logical vector...whichv(lvec,FALSE)# is fastver than which(!lvec)rm(lvec)# Using the %==% operator can yield tangible performance gainsfsubset(wlddev, iso3c %==%"DEU")# 3x faster than:fsubset(wlddev, iso3c =="DEU")# With multiple categories we can use %iin%fsubset(wlddev, iso3c %iin% c("DEU","ITA","FRA"))## Math by reference: permissible types of operationsx <- alloc(1.0,1e5)# Vectorx %+=%1x %+=%1:1e5xm <- matrix(alloc(1.0,1e5), ncol =100)# Matrixxm %+=%1xm %+=%1:1e3setop(xm,"+",1:100, rowwise =TRUE)xm %+=% xm
xm %+=%1:1e5xd <- qDF(replicate(100, alloc(1.0,1e3), simplify =FALSE))# Data Framexd %+=%1xd %+=%1:1e3setop(xd,"+",1:100, rowwise =TRUE)xd %+=% xd
rm(x, xm, xd)## setv() and copyv()x <- rnorm(100)y <- sample.int(10,100, replace =TRUE)setv(y,5,0)# Faster than y[y == 5] <- 0setv(y,4, x)# Faster than y[y == 4] <- x[y == 4]setv(y,20:30, y[40:50])# Faster than y[20:30] <- y[40:50]setv(y,20:30, x)# Faster than y[20:30] <- x[20:30]rm(x, y)# Working with data frames, here returning copies of the framecopyv(mtcars,20:30, ss(mtcars,10:20))copyv(mtcars,20:30, fscale(mtcars))ftransform(mtcars, new = copyv(cyl,4, vs))# Column-wise:copyv(mtcars,2:3, fscale(mtcars), xlist =TRUE)copyv(mtcars,2:3, mtcars[4:5], xlist =TRUE)## Missing valuesmtc_na <- na_insert(mtcars,0.15)# Set 15% of values missing at randomfnobs(mtc_na)# See observation countmissing_cases(mtc_na)# Fast equivalent to !complete.cases(mtc_na)missing_cases(mtc_na, cols =3:4)# Missing cases on certain columns?missing_cases(mtc_na, count =TRUE)# Missing case countmissing_cases(mtc_na, prop =0.8)# Cases with 80% or more missingmissing_cases(mtc_na, cols =3:4, prop =1)# Cases mssing columns 3 and 4missing_cases(mtc_na, cols =3:4, count =TRUE)# Missing case count on columns 3 and 4na_omit(mtc_na)# 12x faster than na.omit(mtc_na)na_omit(mtc_na, prop =0.8)# Only remove cases missing 80% or morena_omit(mtc_na, na.attr =TRUE)# Adds attribute with removed cases, like na.omitna_omit(mtc_na, cols = .c(vs, am))# Removes only cases missing vs or amna_omit(qM(mtc_na))# Also works for matricesna_omit(mtc_na$vs, na.attr =TRUE)# Also works with vectorsna_rm(mtc_na$vs)# For vectors na_rm is faster ...rm(mtc_na)## Efficient vectorizationhead(vec(EuStockMarkets))# Atomic objects: no copy at allhead(vec(mtcars))# Lists: directly in Coptions(oldopts)