Synthetic Data Generation for the Basic Unit-Level SAE Model
Synthetic Data Generation for the Basic Unit-Level SAE Model
This function generates synthetic data (possibly contaminated by outliers) for the basic unit-level SAE model.
makedata(seed =1024, intercept =1, beta =1, n =4, g =20, areaID =NULL, ve =1, ve.contam =41, ve.epsilon =0, vu =1, vu.contam =41, vu.epsilon =0)
Arguments
seed: [integer] seed value used in set.seed (default seed = 1024).
intercept: [numeric] or [NULL] value of the intercept of the fixed-effects model or NULL for a model without intercept (default: intercept = 1).
beta: [numeric vector] value of the fixed-effect coefficients (without intercept; default: beta = 1). For each given coefficient, a vector of realizations is drawn from the standard normal distribution.
n: [integer] number of units per area in balanced-data setups (default: n = 4).
g: [integer] number of areas (default: g = 20).
areaID: [integer vector] or [NULL]. If one attempts to generate synthetic unbalanced data, one calls makedata with a vector, the elements of which area identifiers. This vector should contain a series of (integer valued) area IDs. The number of areas is set equal to the number unique IDs.
ve: [numeric] nonnegative value of model/ residual variance.
ve.contam: [numeric] nonnegative value of model variance of the outlier part in a mixture distribution (Tukey-Huber-type contamination model) e=(1−h)∗N(0,ve)+h∗N(0,ve.contam).
ve.epsilon: [numeric] value in [0,1] that defines the relative number of outliers (i.e., epsilon or h in the contamination mixture distribution). Typically, it takes values between 0 and 0.5 (but it is not restricted to this interval).
vu: [numeric] value of the (area-level) random-effect variance.
vu.contam: [numeric] nonnegative value of the (area-level) random-effect variance of the outlier part in the contamination mixture distribution.
vu.epsilon: [numeric] value in [0,1] that defines the relative number of outliers in the contamination mixture distribution of the (area-level) random effects.
Details
Let e[i] denote an area-specific n[i]-vector of the response variable for the areas i=1,...,g. Define a (n[i]∗p)-matrix X[i] of realizations from the std. normal distribution, N(0,1), and let β denote a p-vector of regression coefficients. Now, the y[i] are drawn using the law c("y[i]N(X[i]beta,\n", "v[e]I[i]+v[u]J[i])") with v[e] and v[u] the variances of the model error and random-effect variance, respectively, and I[i] and J[i] denoting the identity matrix and matrix of ones, respectively.
In addition, we allow the distribution of the model/residual and area-level random effect to be contaminated (cf. Stahel and Welsh, 1997). Notably, the laws of e[ij] and u[i] are replaced by the Tukey-Huber contamination mixture:
where ϵ[ve] and ϵ[vu] regulate the degree of contamination; v[e,ϵ] and v[u,ϵ]
define the variance of the contamination part of the mixture distribution.
Four different contamination setups are possible:
no contamination (i.e., ve.epsilon = vu.epsilon = 0),
contaminated model error (i.e., ve.epsilon != 0 and vu.epsilon = 0),
contaminated random effect (i.e., ve.epsilon = 0 and vu.epsilon != 0),
both are conaminated (i.e., ve.epsilon != 0 and vu.epsilon != 0).
Returns
An instance of the class saemodel.
References
Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41 , 243--265. tools:::Rd_expr_doi("https://doi.org/10.17713/ajs.v41i4.1548")
Stahel, W. A. and A. Welsh (1997). Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference 57 , 295--319. tools:::Rd_expr_doi("https://doi.org/10.1016/S0378-3758(96)00050-X")
See Also
saemodel(), fitsaemodel()
Examples
# generate a model with synthetic datamodel <- makedata()model
# summary of the modelsummary(model)