y: second data.frame. Ignored when deduplication = TRUE.
on: the variables defining on which the pairs of records from x
and y are compared.
minsim: minimal similarity score.
on_blocking: variables for which the pairs have to match.
comparators: named list of functions with which the variables are compared. This function should accept two vectors. Function should either return a vector or a data.table with multiple columns.
default_comparator: variables for which no comparison function is defined using comparators is compares with the function default_comparator.
keep_simsum: add a variable minsim to the result with the similarity score of the pair.
deduplication: generate pairs from only x. Ignore y. This is usefull for deduplication of x.
add_xy: add x and y as attributes to the returned pairs. This makes calling some subsequent operations that need x and y (such as compare_pairs easier.
Returns
A data.table with two columns, .x and .y, is returned. Columns .x and .y are row numbers from data.frames .x and .y respectively.
Details
Generating (all) pairs of the records of two data sets, is usually the first step when linking the two data sets. However, this often results in a too large number of records. pair_minsim will only keep pairs with a similarity score equal or larger than minsim. The similarity score is calculated by summing the results of the comparators for all variables of on.
Missing values in the variables on which the pairs are compared count as a similarity of 0.
Examples
data("linkexample1","linkexample2")pairs <- pair_minsim(linkexample1, linkexample2, on = c("postcode","address"), minsim =1)# Either address or postcode has to match to keep a pairdata("linkexample1","linkexample2")pairs <- pair_minsim(linkexample1, linkexample2, on_blocking ="postcode", on = c("lastname","firstname","address"), minsim =2)# Postcode has to match; from lastname, firstname, address there have to match# two or more (e.g. one mismatch is allowed).
See Also
pair and pair_blocking are other methods to generate pairs.