Given a variance estimation function (specific to a survey), define_variance_wrapper defines a variance estimation wrapper easier to use (e.g. automatic domain estimation, linearization).
variance_function: An R function. It is the methodological workhorse of the variance estimation: from a set of arguments including the variables of interest (see below), it should return a vector of estimated variances. See Details.
reference_id: A vector containing the ids of all the responding units of the survey. It can also be an unevaluated expression (enclosed in quote()) to be evaluated within the execution environment of the wrapper. It is compared with default$id (see below) to check whether some observations are missing in the survey file. The matrix of variables of interest passed on to variance_function has reference_id
as rownames and is ordered according to its values.
reference_weight: A vector containing the reference weight of the survey. It can also be an unevaluated expression (enclosed in quote()) to be evaluated within the execution environment of the wrapper.
default_id: A character vector of length 1, the name of the default identifying variable in the survey file. It can also be an unevaluated expression (enclosed in quote()) to be evaluated within the survey file.
technical_data: A named list of technical data needed to perform the variance estimation (e.g. sampling strata, first- or second-order probabilities of inclusion, estimated response probabilities, calibration variables). Its names should match the names of the corresponding arguments in variance_function.
technical_param: A named list of technical parameters used to control some aspect of the variance estimation process (e.g. alternative methodology). Its names should match the names of the corresponding arguments in variance_function.
objects_to_include: (Advanced use) A character vector indicating the name of additional R objects to include within the variance wrapper.
Returns
An R function that makes the estimation of variance based on the provided variance function easier. Its parameters are:
data: one or more calls to a statistic wrapper (e.g. total(), mean(), ratio()). See examples and standard statistic wrappers) and standard statistic wrappers)
where: a logical vector indicating a domain on which the variance estimation is to be performed
by: q qualitative variable whose levels are used to define domains on which the variance estimation is performed
alpha: a numeric vector of length 1 indicating the threshold for confidence interval derivation (0.05 by default)
display: a logical verctor of length 1 indicating whether the result of the estimation should be displayed or not
id: a character vector of size 1 containing the name of the identifying variable in the survey file. Its default value depends on the value of default_id in define_variance_wrapper
envir: an environment containing a binding to data
Details
Defining variance estimation wrappers is the key feature of the gustave package. It is the workhorse of the ready-to-use qvar function and should be used directly to handle more complex cases (e.g. surveys with several stages or balanced sampling).
Analytical variance estimation is often difficult to carry out by non-specialists owing to the complexity of the underlying sampling and estimation methodology. This complexity yields complex variance estimation functions which are most often only used by the sampling expert who actually wrote them. A variance estimation wrapper is an intermediate function that is "wrapped around" the (complex) variance estimation function in order to provide the non-specialist with user-friendly features (see examples):
calculation of complex statistics (see standard statistic wrappers)
domain estimation
handy evaluation and factor discretization
define_variance_wrapper allows the sampling expert to define a variance estimation wrapper around a given variance estimation function and set its default parameters. The produced variance estimation wrapper is standalone in the sense that it contains all technical data necessary to carry out the estimation (see technical_data).
The arguments of the variance_function fall into three types:
the data argument (mandatory, only one allowed): the numerical matrix of variables of interest to apply the variance estimation formula on
technical data arguments (optional, one or more allowed): technical and methodological information used by the variance estimation function (e.g. sampling strata, first- or second-order probabilities of inclusion, estimated response probabilities, calibration variables)
technical parameters (optional, one or more allowed): non-data arguments to be used to control some aspect of the variance estimation (e.g. alternative methodology)
technical_data and technical_param are used to determine
which arguments of variance_function relate to technical information,
the only remaining argument is considered as the data argument.
Examples
### Example from the Labour force survey (LFS)# The (simulated) Labour force survey (LFS) has the following characteristics:# - first sampling stage: balanced sampling of 4 areas (each corresponding to # about 120 dwellings) on first-order probability of inclusion (proportional to # the number of dwellings in the area) and total annual income in the area.# - second sampling stage: in each sampled area, simple random sampling of 20 # dwellings# - neither non-response nor calibration# As this is a multi-stage sampling design with balanced sampling at the first# stage, the qvar function does not apply. A variance wrapper can nonetheless# be defined using the core define_variance_wrapper function.# Step 1 : Definition of the variance function and the corresponding technical data# In this context, the variance estimation function specific to the LFS # survey can be defined as follows:var_lfs <-function(y, ind, dwel, area){ variance <- list()# Variance associated with the sampling of the dwellings y <- sum_by(y, ind$id_dwel) variance[["dwel"]]<- var_srs( y = y, pik = dwel$pik_dwel, strata = dwel$id_area, w =(1/ dwel$pik_area^2- dwel$q_area))# Variance associated with the sampling of the areas y <- sum_by(y = y, by = dwel$id_area, w =1/ dwel$pik_dwel) variance[["area"]]<- varDT(y = y, precalc = area) Reduce(`+`, variance)}# where y is the matrix of variables of interest and ind, dwel and area the technical data:technical_data_lfs <- list()# Technical data at the area level# The varDT function allows for the pre-calculation of # most of the methodological quantities needed.technical_data_lfs$area <- varDT( y =NULL, pik = lfs_samp_area$pik_area, x = as.matrix(lfs_samp_area[c("pik_area","income")]), id = lfs_samp_area$id_area
)# Technical data at the dwelling level# In order to implement Rao (1975) formula for two-stage samples,# we associate each dwelling with the diagonal term corresponding # to its area in the first-stage variance estimator: lfs_samp_dwel$q_area <- with(technical_data_lfs$area, setNames(diago, id))[lfs_samp_dwel$id_area]technical_data_lfs$dwel <- lfs_samp_dwel[c("id_dwel","pik_dwel","id_area","pik_area","q_area")]# Technical data at the individual leveltechnical_data_lfs$ind <- lfs_samp_ind[c("id_ind","id_dwel","sampling_weight")]# Test of the variance function var_lfsy <- matrix(as.numeric(lfs_samp_ind$unemp), ncol =1, dimnames = list(lfs_samp_ind$id_ind))with(technical_data_lfs, var_lfs(y = y, ind = ind, dwel = dwel, area = area))# Step 2 : Definition of the variance wrapper# Call of define_variance_wrapperprecision_lfs <- define_variance_wrapper( variance_function = var_lfs, technical_data = technical_data_lfs, reference_id = technical_data_lfs$ind$id_ind, reference_weight = technical_data_lfs$ind$sampling_weight, default_id ="id_ind")# Testprecision_lfs(lfs_samp_ind, unemp)# The variance wrapper precision_lfs has the same features# as variance wrappers produced by the qvar function (see# qvar examples for more details).
References
Rao, J.N.K (1975), "Unbiased variance estimation for multistage designs", Sankhya, C n°37