blockData function

blockData

blockData

Contains functionalities for blocking two data sets on one or more variables prior to conducting a merge.

blockData(dfA, dfB, varnames, window.block, window.size, kmeans.block, nclusters, iter.max, n.cores)

Arguments

  • dfA: Dataset A - to be matched to Dataset B
  • dfB: Dataset B - to be matched to Dataset A
  • varnames: A vector of variable names to use for blocking. Must be present in both dfA and dfB
  • window.block: A vector of variable names indicating that the variable should be blocked using windowing blocking. Must be present in varnames.
  • window.size: The size of the window for window blocking. Default is 1 (observations +/- 1 on the specified variable will be blocked together).
  • kmeans.block: A vector of variable names indicating that the variable should be blocked using k-means blocking. Must be present in varnames.
  • nclusters: Number of clusters to create with k-means. Default value is the number of clusters where the average cluster size is 100,000 observations.
  • iter.max: Maximum number of iterations for the k-means algorithm to run. Default is 5000
  • n.cores: Number of cores to parallelize over. Default is NULL.

Returns

A list with an entry for each block. Each list entry contains two vectors --- one with the indices indicating the block members in dataset A, and another containing the indices indicating the block members in dataset B.

Examples

## Not run: block_out <- blockData(dfA, dfB, varnames = c("city", "birthyear")) ## End(Not run)
  • Maintainer: Ted Enamorado
  • License: GPL (>= 3)
  • Last published: 2023-11-17

Useful links