makedata function

Synthetic Data Generation for the Basic Unit-Level SAE Model

Synthetic Data Generation for the Basic Unit-Level SAE Model

This function generates synthetic data (possibly contaminated by outliers) for the basic unit-level SAE model.

makedata(seed = 1024, intercept = 1, beta = 1, n = 4, g = 20, areaID = NULL, ve = 1, ve.contam = 41, ve.epsilon = 0, vu = 1, vu.contam = 41, vu.epsilon = 0)

Arguments

  • seed: [integer] seed value used in set.seed (default seed = 1024).
  • intercept: [numeric] or [NULL] value of the intercept of the fixed-effects model or NULL for a model without intercept (default: intercept = 1).
  • beta: [numeric vector] value of the fixed-effect coefficients (without intercept; default: beta = 1). For each given coefficient, a vector of realizations is drawn from the standard normal distribution.
  • n: [integer] number of units per area in balanced-data setups (default: n = 4).
  • g: [integer] number of areas (default: g = 20).
  • areaID: [integer vector] or [NULL]. If one attempts to generate synthetic unbalanced data, one calls makedata with a vector, the elements of which area identifiers. This vector should contain a series of (integer valued) area IDs. The number of areas is set equal to the number unique IDs.
  • ve: [numeric] nonnegative value of model/ residual variance.
  • ve.contam: [numeric] nonnegative value of model variance of the outlier part in a mixture distribution (Tukey-Huber-type contamination model) e=(1h)N(0,ve)+hN(0,ve.contam)e = (1-h)*N(0, ve) + h*N(0, ve.contam).
  • ve.epsilon: [numeric] value in [0,1][0,1] that defines the relative number of outliers (i.e., epsilon or h in the contamination mixture distribution). Typically, it takes values between 0 and 0.5 (but it is not restricted to this interval).
  • vu: [numeric] value of the (area-level) random-effect variance.
  • vu.contam: [numeric] nonnegative value of the (area-level) random-effect variance of the outlier part in the contamination mixture distribution.
  • vu.epsilon: [numeric] value in [0,1][0,1] that defines the relative number of outliers in the contamination mixture distribution of the (area-level) random effects.

Details

Let e[i]e[i] denote an area-specific n[i]n[i]-vector of the response variable for the areas i=1,...,gi = 1,..., g. Define a (n[i]p)(n[i] * p)-matrix X[i]X[i] of realizations from the std. normal distribution, N(0,1)N(0,1), and let β\beta denote a pp-vector of regression coefficients. Now, the y[i]y[i] are drawn using the law c("y[i] N(X[i]beta,\ny[i] ~ N(X[i]\\beta,\n", "v[e]I[i]+v[u]J[i]) v[e] I[i] + v[u] J[i])") with v[e]v[e] and v[u]v[u] the variances of the model error and random-effect variance, respectively, and I[i]I[i] and J[i]J[i] denoting the identity matrix and matrix of ones, respectively.

In addition, we allow the distribution of the model/residual and area-level random effect to be contaminated (cf. Stahel and Welsh, 1997). Notably, the laws of e[ij]e[ij] and u[i]u[i] are replaced by the Tukey-Huber contamination mixture:

  • c("e[i,j] (1\ne[i,j] ~ (1 -\n", "epsilon[ve])N(0,v[e])+epsilon[ve]N(0,v[e,epsilon]) \\epsilon[ve]) N(0,v[e]) + \\epsilon[ve] N(0, v[e,\\epsilon])")
  • c("u[i] (1\nu[i] ~ (1 -\n", "epsilon[vu])N(0,v[u])+epsilon[vu]N(0,v[u,epsilon]) \\epsilon[vu]) N(0,v[u]) + \\epsilon[vu] N(0, v[u,\\epsilon])")

where ϵ[ve]\epsilon[ve] and ϵ[vu]\epsilon[vu] regulate the degree of contamination; v[e,ϵ]v[e,\epsilon] and v[u,ϵ]v[u,\epsilon]

define the variance of the contamination part of the mixture distribution.

Four different contamination setups are possible:

  • no contamination (i.e., ve.epsilon = vu.epsilon = 0),
  • contaminated model error (i.e., ve.epsilon != 0 and vu.epsilon = 0),
  • contaminated random effect (i.e., ve.epsilon = 0 and vu.epsilon != 0),
  • both are conaminated (i.e., ve.epsilon != 0 and vu.epsilon != 0).

Returns

An instance of the class saemodel.

References

Schoch, T. (2012). Robust Unit-Level Small Area Estimation: A Fast Algorithm for Large Datasets. Austrian Journal of Statistics 41 , 243--265. tools:::Rd_expr_doi("https://doi.org/10.17713/ajs.v41i4.1548")

Stahel, W. A. and A. Welsh (1997). Approaches to robust estimation in the simplest variance components model. Journal of Statistical Planning and Inference 57 , 295--319. tools:::Rd_expr_doi("https://doi.org/10.1016/S0378-3758(96)00050-X")

See Also

saemodel(), fitsaemodel()

Examples

# generate a model with synthetic data model <- makedata() model # summary of the model summary(model)
  • Maintainer: Tobias Schoch
  • License: GPL-3
  • Last published: 2024-02-06