Candidate pairs from LSH comparisons
Given a data frame of LSH buckets returned from lsh
, this function returns the potential candidates.
lsh_candidates(buckets)
buckets
: A data frame returned from lsh
.A data frame of candidate pairs.
dir <- system.file("extdata/legal", package = "textreuse") minhash <- minhash_generator(200, seed = 234) corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, minhash_func = minhash) buckets <- lsh(corpus, bands = 50) lsh_candidates(buckets)
Useful links