Various Blocking Methods for Entity Resolution
Block records based on character vectors
Controls for the Annoy algorithm
Controls for the HNSW algorithm
Controls for the k-d tree algorithm
Controls for the LSH algorithm
Controls for the NND algorithm
Controls for approximate nearest neighbours algorithms
Controls for processing character data
Estimate errors due to blocking in record linkage
An internal function to use Annoy algorithm via the RcppAnnoy package.
An internal function to use HNSW algorithm via the RcppHNSW package.
An internal function to use the LSH and KD-tree algorithm via the mlpa...
An internal function to use the NN descent algorithm via the rnndescen...
Integration with the reclin2 package
Sentence to vector
The goal of 'blocking' is to provide blocking methods for record linkage and deduplication using approximate nearest neighbour (ANN) algorithms and graph techniques. It supports multiple ANN implementations via 'rnndescent', 'RcppHNSW', 'RcppAnnoy', and 'mlpack' packages, and provides integration with the 'reclin2' package. The package generates shingles from character strings and similarity vectors for record comparison, and includes evaluation metrics for assessing blocking performance including false positive rate (FPR) and false negative rate (FNR) estimates. For details see: Papadakis et al. (2020) <doi:10.1145/3377455>, Steorts et al. (2014) <doi:10.1007/978-3-319-11257-2_20>, Dasylva and Goussanou (2021) <https://www150.statcan.gc.ca/n1/en/catalogue/12-001-X202100200002>, Dasylva and Goussanou (2022) <doi:10.1007/s42081-022-00153-3>.
Useful links