Calculate information loss after targeted record swapping
Calculate information loss after targeted record swapping
Calculate information loss after targeted record swapping using both the original and the swapped micro data. Information loss will be calculated on table counts defined by parameter table_vars using either implemented information loss measures like absolute deviaton, relative absolute deviation and absolute deviation of square roots or custom metric, See details below.
data: original micro data set, must be either a data.table or data.frame.
data_swapped: micro data set after targeted record swapping was applied. Must be either a data.table or data.frame.
table_vars: column names in both data and data_swapped. Defines the variables over which a (multidimensional) frequency table is constructed. Information loss is then calculated by applying the metric in metric and custom_merics over the cell-counts and margin counts of the table from data and data_swapped.
metric: character vector containing one or more of the already implemented metrices: "absD","relabsD" and/or "abssqrtD".
custom_metric: function or (named) list of functions. Functions defined here must be of the form fun(x,y,...) where x and y expect numeric values of the same length. The output of these functions must be a numeric vector of the same length as x and y.
hid: NULL or character containing household id in data and data_swapped. If not NULL frequencies will reflect number of households, otherwise frequencies will reflect number of persons.
probs: numeric vector containing values in the inervall [0,1].
quantvals: optional numeric vector which defines the groups used for the cumulative outputs. Is applied on the results m from each information loss metric as cut(m,breaks=quantvals,include.lowest=TRUE), see also return values.
apply_quantvals: character vector defining for the output of which metrices quantvals should be applied to.
exclude_zeros: TRUE or FALSE, if TRUE 0 cells in the frequency table using data_swapped will be ignored.
only_inner_cells: TRUE or FALSE, if TRUE only inner cells of the frequency table defined by table_vars will be compared. Otherwise also all tables margins will bei calculated.
Returns
Returns a list containing:
cellvalues: data.table showing in a long format for each table cell the frequency counts for data ~ count_o and data_swapped ~ count_s. * overview: data.table containing the disribution of the noise in number of cells and percentage. The noise ist calculated as the difference between the cell values of the frequency table generated from the original and swapped data * measures: data.table containing the quantiles and mean (column waht) of the distribution of the information loss metrices applied on each table cell. The quantiles are defined by parameter probs. * cumdistr\*: data.table containing the cumulative distribution of the information loss metrices. Distribution is shown in number of cells (cnt) and percentage (pct). Column cat shows all unique values of the information loss metric or the grouping defined by quantvals. * false_zero: number of table cells which are non-zero when using data and zero when using data_swapped. * false_nonzero: number of table cells which are zero when using data and non-zero when using data_swapped. * exclude_zeros: value passed to exclude_zero when calling the function.
Details
First frequency tables are build from both data and data_swapped using the variables defined in table_vars. By default also all table margins will be calculated, see parameter only_inner_cells = FALSE. After that the information loss metrices defined in either metric or custom_metric are applied on each of the table cells from both frequency tables. This is done in the sense of metric(x,y) where metric is the information loss, x a cell from the table created from data and y the same cell from the table created from data_swapped. One or more custom metrices can be applied using the parameter custom_metric, see also examples.
Examples
# generate dummy data seed <-2021set.seed(seed)nhid <-10000dat <- createDat( nhid )# define paramters for swappingk_anonymity <-1swaprate <-.05similar <- list(c("hsize"))hier <- c("nuts1","nuts2")carry_along <- c("nuts3","lau2")risk_variables <- c("ageGroup","national")hid <-"hid"# # apply record swapping# dat_s <- recordSwap(data = dat, hid = hid, hierarchy = hier,# similar = similar, swaprate = swaprate,# k_anonymity = k_anonymity,# risk_variables = risk_variables,# carry_along = carry_along,# return_swapped_id = TRUE,# seed=seed)# # # # calculate informationn loss# # for the table nuts2 x national# iloss <- infoLoss(data=dat, data_swapped = dat_s,# table_vars = c("nuts2","national"))# iloss$measures # distribution of information loss measures# iloss$false_zero # no false zeros# iloss$false_nonzero # no false non-zeros# # # frequency tables of households accross# # nuts2 x hincome# # iloss <- infoLoss(data=dat, data_swapped = dat_s,# table_vars = c("nuts2","hincome"),# hid = "hid")# iloss$measures # # # define custom metric# squareD <- function(x,y){# (x-y)^2# }# # iloss <- infoLoss(data=dat, data_swapped = dat_s,# table_vars = c("nuts2","national"),# custom_metric = list(squareD=squareD))# iloss$measures # includes custom loss as well#