Fast Between (Averaging) and (Quasi-)Within (Centering) Transformations
Fast Between (Averaging) and (Quasi-)Within (Centering) Transformations
fbetween and fwithin are S3 generics to efficiently obtain between-transformed (averaged) or (quasi-)within-transformed (demeaned) data. These operations can be performed groupwise and/or weighted. B and W are wrappers around fbetween and fwithin representing the 'between-operator' and the 'within-operator'.
(B / W provide more flexibility than fbetween / fwithin when applied to data frames (i.e. column subsetting, formula input, auto-renaming and id-variable-preservation capabilities...), but are otherwise identical.)
fbetween(x,...) fwithin(x,...) B(x,...) W(x,...)## Default S3 method:fbetween(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## Default S3 method:fwithin(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## Default S3 method:B(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## Default S3 method:W(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'matrix'fbetween(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## S3 method for class 'matrix'fwithin(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'matrix'B(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE, stub = .op[["stub"]],...)## S3 method for class 'matrix'W(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1, stub = .op[["stub"]],...)## S3 method for class 'data.frame'fbetween(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## S3 method for class 'data.frame'fwithin(x, g =NULL, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'data.frame'B(x, by =NULL, w =NULL, cols = is.numeric, na.rm = .op[["na.rm"]], fill =FALSE, stub = .op[["stub"]], keep.by =TRUE, keep.w =TRUE,...)## S3 method for class 'data.frame'W(x, by =NULL, w =NULL, cols = is.numeric, na.rm = .op[["na.rm"]], mean =0, theta =1, stub = .op[["stub"]], keep.by =TRUE, keep.w =TRUE,...)# Methods for indexed data / compatibility with plm:## S3 method for class 'pseries'fbetween(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## S3 method for class 'pseries'fwithin(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'pseries'B(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## S3 method for class 'pseries'W(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'pdata.frame'fbetween(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE,...)## S3 method for class 'pdata.frame'fwithin(x, effect =1L, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1,...)## S3 method for class 'pdata.frame'B(x, effect =1L, w =NULL, cols = is.numeric, na.rm = .op[["na.rm"]], fill =FALSE, stub = .op[["stub"]], keep.ids =TRUE, keep.w =TRUE,...)## S3 method for class 'pdata.frame'W(x, effect =1L, w =NULL, cols = is.numeric, na.rm = .op[["na.rm"]], mean =0, theta =1, stub = .op[["stub"]], keep.ids =TRUE, keep.w =TRUE,...)# Methods for grouped data frame / compatibility with dplyr:## S3 method for class 'grouped_df'fbetween(x, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE, keep.group_vars =TRUE, keep.w =TRUE,...)## S3 method for class 'grouped_df'fwithin(x, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1, keep.group_vars =TRUE, keep.w =TRUE,...)## S3 method for class 'grouped_df'B(x, w =NULL, na.rm = .op[["na.rm"]], fill =FALSE, stub = .op[["stub"]], keep.group_vars =TRUE, keep.w =TRUE,...)## S3 method for class 'grouped_df'W(x, w =NULL, na.rm = .op[["na.rm"]], mean =0, theta =1, stub = .op[["stub"]], keep.group_vars =TRUE, keep.w =TRUE,...)
Arguments
x: a numeric vector, matrix, data frame, 'indexed_series' ('pseries'), 'indexed_frame' ('pdata.frame') or grouped data frame ('grouped_df').
g: a factor, GRP object, or atomic vector / list of vectors (internally grouped with group) used to group x.
by: B and W data.frame method: Same as g, but also allows one- or two-sided formulas i.e. ~ group1 or var1 + var2 ~ group1 + group2. See Examples.
w: a numeric vector of (non-negative) weights. B/W data frame and pdata.frame methods also allow a one-sided formula i.e. ~ weightcol. The grouped_df (dplyr) method supports lazy-evaluation. See Examples.
cols: B/W (p)data.frame methods: Select columns to scale using a function, column names, indices or a logical vector. Default: All numeric columns. Note: cols is ignored if a two-sided formula is passed to by.
na.rm: logical. Skip missing values in x and w when computing averages. If na.rm = FALSE and a NA or NaN is encountered, the average for that group will be NA, and all data points belonging to that group in the output vector will also be NA.
effect: plm methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the index , 2L the second etc. Index variables can also be called by name using a character string. If more than one variable is supplied, the corresponding index-factors are interacted.
stub: character. A prefix/stub to add to the names of all transformed columns. TRUE (default) uses "W."/"B.", FALSE will not rename columns.
fill: option to fbetween/B: Logical. TRUE will overwrite missing values in x with the respective average. By default missing values in x are preserved.
mean: option to fwithin/W: The mean to center on, default is 0, but a different mean can be supplied and will be added to the data after the centering is performed. A special option when performing grouped centering is mean = "overall.mean". In that case the overall mean of the data will be added after subtracting out group means.
theta: option to fwithin/W: Double. An optional scalar parameter for quasi-demeaning i.e. x - theta * xi.. This is useful for variance components ('random-effects') estimators. see Details.
keep.by, keep.ids, keep.group_vars: B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain grouping / panel-identifier columns in the output. For data frames this only works if grouping variables were passed in a formula.
keep.w: B and W data.frame, pdata.frame and grouped_df methods: Logical. Retain column containing the weights in the output. Only works if w is passed as formula / lazy-expression.
...: arguments to be passed to or from other methods.
Details
Without groups, fbetween/B replaces all data points in x with their mean or weighted mean (if w is supplied). Similarly fwithin/W subtracts the (weighted) mean from all data points i.e. centers the data on the mean.
With groups supplied to g, the replacement / centering performed by fbetween/B | fwithin/W becomes groupwise. In terms of panel data notation: If x is a vector in such a panel dataset, xit denotes a single data-point belonging to group i in time-period t (t need not be a time-period). Then xi. denotes x, averaged over t. fbetween/B now returns xi. and fwithin/W returns x - xi.. Thus for any data x and any grouping vector g: B(x,g) + W(x,g) = xi. + x - xi. = x. In terms of variance, fbetween/B only retains the variance between group averages, while fwithin/W, by subtracting out group means, only retains the variance within those groups.
The data replacement performed by fbetween/B can keep (default) or overwrite missing values (option fill = TRUE) in x. fwithin/W can center data simply (default), or add back a mean after centering (option mean = value), or add the overall mean in groupwise computations (option mean = "overall.mean"). Let x.. denote the overall mean of x, then fwithin/W with mean = "overall.mean" returns x - xi. + x.. instead of x - xi.. This is useful to get rid of group-differences but preserve the overall level of the data. In regression analysis, centering with mean = "overall.mean" will only change the constant term. See Examples.
If theta != 1, fwithin/W performs quasi-demeaning x - theta * xi.. If mean = "overall.mean", x - theta * xi. + theta * x.. is returned, so that the mean of the partially demeaned data is still equal to the overall data mean x... A numeric value passed to mean will simply be added back to the quasi-demeaned data i.e. x - theta * xi. + mean.
Now in the case of a linear panel model yit=β0+β1Xit+uit with uit=αi+ϵit. If αi=α=const. (there exists individual heterogeneity), then pooled OLS is at least inefficient and inference on β1 is invalid. If E[αi∣Xit]=0 (mean independence of individual heterogeneity αi), the variance components or 'random-effects' estimator provides an asymptotically efficient FGLS solution by estimating a transformed model yit−θyi.=β0+β1(Xit−θXi.)+(uit−θui.), where θ=1−(σα2+Tσϵ2)σα. An estimate of θ can be obtained from the an estimate of u^it (the residuals from the pooled model). If E[αi∣Xit]=0, pooled OLS is biased and inconsistent, and taking θ=1 gives an unbiased and consistent fixed-effects estimator of β1. See Examples.
Returns
fbetween/B returns x with every element replaced by its (groupwise) mean (xi.). Missing values are preserved if fill = FALSE (the default). fwithin/W returns x where every element was subtracted its (groupwise) mean (x - theta * xi. + mean or, if mean = "overall.mean", x - theta * xi. + theta * x..). See Details.
References
Mundlak, Yair. 1978. On the Pooling of Time Series and Cross Section Data. Econometrica 46 (1): 69-85.
See Also
fhdbetween/HDB and fhdwithin/HDW, fscale/STD, TRA, Data Transformations , Collapse Overview
Examples
## Simple centering and averaginghead(fbetween(mtcars))head(B(mtcars))head(fwithin(mtcars))head(W(mtcars))all.equal(fbetween(mtcars)+ fwithin(mtcars), mtcars)## Groupwise centering and averaginghead(fbetween(mtcars, mtcars$cyl))head(fwithin(mtcars, mtcars$cyl))all.equal(fbetween(mtcars, mtcars$cyl)+ fwithin(mtcars, mtcars$cyl), mtcars)head(W(wlddev,~ iso3c, cols =9:13))# Center the 5 series in this dataset by countryhead(cbind(get_vars(wlddev,"iso3c"),# Same thing done manually using fwithin.. add_stub(fwithin(get_vars(wlddev,9:13), wlddev$iso3c),"W.")))## Using B() and W() for fixed-effects regressions:# Several ways of running the same regression with cyl-fixed effectslm(W(mpg,cyl)~ W(carb,cyl), data = mtcars)# Centering each individuallylm(mpg ~ carb, data = W(mtcars,~ cyl, stub =FALSE))# Centering the entire datalm(mpg ~ carb, data = W(mtcars,~ cyl, stub =FALSE,# Here only the intercept changes mean ="overall.mean"))lm(mpg ~ carb + B(carb,cyl), data = mtcars)# Procedure suggested by# ..Mundlak (1978) - partialling out group averages amounts to the same as demeaning the dataplm::plm(mpg ~ carb, mtcars, index ="cyl", model ="within")# "Proof"..# This takes the interaction of cyl, vs and am as fixed effectslm(W(mpg)~ W(carb), data = iby(mtcars, id = finteraction(cyl, vs, am)))lm(mpg ~ carb, data = W(mtcars,~ cyl + vs + am, stub =FALSE))lm(mpg ~ carb + B(carb,list(cyl,vs,am)), data = mtcars)# Now with cyl fixed effects weighted by hp:lm(W(mpg,cyl,hp)~ W(carb,cyl,hp), data = mtcars)lm(mpg ~ carb, data = W(mtcars,~ cyl,~ hp, stub =FALSE))lm(mpg ~ carb + B(carb,cyl,hp), data = mtcars)# WRONG ! Gives a different coefficient!!## Manual variance components (random-effects) estimationres <- HDW(mtcars, mpg ~ carb)[[1]]# Get residuals from pooled OLSsig2_u <- fvar(res)sig2_e <- fvar(fwithin(res, mtcars$cyl))T <- length(res)/ fndistinct(mtcars$cyl)sig2_alpha <- sig2_u - sig2_e
theta <-1- sqrt(sig2_alpha)/ sqrt(sig2_alpha + T * sig2_e)lm(mpg ~ carb, data = W(mtcars,~ cyl, theta = theta, mean ="overall.mean", stub =FALSE))# A slightly different method to obtain theta...plm::plm(mpg ~ carb, mtcars, index ="cyl", model ="random")