Record Linkage Toolkit
Add a variable from one of the data sets to pairs
Call a function on each of the worker nodes and pass it the pairs
Collect pairs from cluster nodes
Call a function on each of the worker nodes to modify the pairs on the...
Generate all possible pairs using multiple processes
Generate pairs using simple blocking using multiple processes
Generate pairs with a minimal similarity using multiple processes
Comparison functions
Compare pairs on a set of variables common in both data sets
Compare pairs on given variables
Deduplication using equivalence groups
Get a subset of pairs to inspect
Greedy one-to-one matching of pairs
Use the selected pairs to generate a linked data set
Tiny example dataset for probabilistic linkage
Force n to m matching on a set of pairs
Merge two sets of pairs into one
Generate all possible pairs
Generate pairs using simple blocking
Generate pairs with a minimal similarity
Calculate weights and probabilities for pairs
Calculate EM-estimates of m- and u-probabilities
Score pairs based on a number of comparison vectors
Select matching pairs enforcing one-to-one linkage
Select matching pairs with a score above or equal to a threshold
Deselect pairs that are linked to multiple records
Summarise the results from problink_em
Create a table of comparison patterns
Spelling variations of a set of town names
Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities (I. Fellegi & A. Sunter (1969) <doi:10.1080/01621459.1969.10501049>, T.N. Herzog, F.J. Scheuren, & W.E. Winkler (2007), "Data Quality and Record Linkage Techniques", ISBN:978-0-387-69502-0), forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage. Focus is on memory, CPU performance and flexibility.