Record Linkage Toolkit
Add a variable from one of the data sets to pairs
Call a function on each of the worker nodes and pass it the pairs
Collect pairs from cluster nodes
Call a function on each of the worker nodes to modify the pairs on the...
Generate pairs using simple blocking using multiple processes
Generate pairs with a minimal similarity using multiple processes
Generate all possible pairs using multiple processes
Comparison functions
Compare pairs on a set of variables common in both data sets
Compare pairs on given variables
Deduplication using equivalence groups
Get a subset of pairs to inspect
Greedy one-to-one matching of pairs
Use the selected pairs to generate a linked data set
Tiny example dataset for probabilistic linkage
Force n to m matching on a set of pairs
Merge two sets of pairs into one
Generate pairs using simple blocking
Generate pairs with a minimal similarity
Generate all possible pairs
Calculate weights and probabilities for pairs
Calculate EM-estimates of m- and u-probabilities
Score pairs based on a number of comparison vectors
Select matching pairs enforcing one-to-one linkage
Select matching pairs with a score above or equal to a threshold
Deselect pairs that are linked to multiple records
Summarise the results from problink_em
Create a table of comparison patterns
Spelling variations of a set of town names
Return default value if value is missing or NULL
Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility.