grouped_resample function

Function for performing simple or Dirichlet resampling

Function for performing simple or Dirichlet resampling

The function may be used for standard bootstrapping or for subsampling, see [1]. This function allows samples to be drawn with or without replacement, by groups and with or without Dirichlet weights, see [2]. This provides a variety of options for researchers who wish to correct sample biases, estimate empirical confidence intervals, and/or subsample large data sets.

grouped_resample(in_data = NULL, grp_vector = NULL, grp_matrix = NULL, replace = FALSE, option = "Simple", number_samples = 1, nworkers = NULL, rseed = NULL)

Arguments

  • in_data: The initial data frame that must be re-sampled. It must contain:

    1. an ID variable
    2. the variables of interest
    3. a grouping variable
  • grp_vector: The grouping variable of the data frame, defined under the name 'group' for example

  • grp_matrix: A matrix that contains

    1. the variable 'Group_ID' with entries all the available values of grouping variable
    2. the variable 'Resample_Size' with the sizes for each sample that will be created per grouping value
  • replace: A logical input: TRUE/FALSE if replacement should be used or not, respectively

  • option: A character input with next possible values

    1. "Simple", if we want to perform a simple re-sampling
    2. "Dirichlet", if we want to perform a Dirichlet weighted re-sampling
  • number_samples: The number of samples to be created. If it is greater than one, then parallel processing is used.

  • nworkers: The number of logical processors that will be used for parallel computing (usually it is the double of available physical cores)

  • rseed: The random seed that will be used for sampling. Useful for reproducible results

Returns

It returns a list of mumber_samples data frames with exactly the same variables as the initial one, except that group variable has now only the given value from input data frame.

References

[1] D. N. Politis, J. P. Romano, M. Wolf, Subsampling (Springer-Verlag, New York, 1999).

[2] Baath R (2018). bayesboot: An Implementation of Rubin's (1981) Bayesian Bootstrap. R package version 0.2.2, URL https://CRAN.R-project.org/package=bayesboot

Author(s)

David Midgley

See Also

dirichlet_sample

Examples

## Load absolute temperature data set: data("AbsoluteTemperature") df <- AbsoluteTemperature ## Find portions for climate zones pcs <- table(df$z)/dim(df)[1] ## Choose the approximate size of the new sample and compute resample sizes N <- round(sqrt(nrow(AbsoluteTemperature))) resamplesizes=as.integer(round(N*pcs)) sum(resamplesizes) ## Create the grouping matrix groupmat <- data.frame("Group_ID"=1:4,"Resample_Size"=resamplesizes) groupmat ## Simple resampling: resample_simple <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat, replace = FALSE, option = "Simple", number_samples = 1, nworkers = NULL, rseed = 20191220) cat(dim(resample_simple[[1]]),"\n") ## Dirichlet resampling: resample_dirichlet <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat, replace = FALSE, option = "Dirichlet", number_samples = 1, nworkers = NULL, rseed = 20191220) cat(dim(resample_dirichlet[[1]]),"\n") ## # ## Work in parallel and create many samples # ## Choose a random seed # nseed <- 20191119 # ## Simple # reslist1 <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat, # replace = FALSE, option = "Simple", # number_samples = 10, nworkers = NULL, # rseed = nseed) # sapply(reslist1, dim) # ## Dirichlet # reslist2 <- grouped_resample(in_data = df, grp_vector = "z", grp_matrix = groupmat, # replace = FALSE, option = "Dirichlet", # number_samples = 10, nworkers = NULL, # rseed = nseed) # sapply(reslist2, dim) # ## Check for same rows between 1st sample of 'Simple' and 1st sample of 'Dirichlet' ... # mapply(function(x,y){sum(rownames(x)%in%rownames(y))},reslist1,reslist2) #
  • Maintainer: Demetris Christopoulos
  • License: GPL (>= 2)
  • Last published: 2024-05-23

Useful links