Detect Text Reuse and Document Similarity
Local alignment of natural language texts
Convert candidates data frames to other formats
Filenames from paths
Hash a string to an integer
Locality sensitive hashing for minhash
Candidate pairs from LSH comparisons
Compare candidates identified by LSH
Probability that a candidate pair will be detected with LSH
Query a LSH cache for matches to a single document
List of all candidates in a corpus
Generate a minhash function
Candidate pairs from pairwise comparisons
Pairwise comparisons among documents in a corpus
Objects exported from other packages
Recompute the hashes for a document or corpus
Measure similarity/dissimilarity in documents
textreuse: Detect Text Reuse and Document Similarity
TextReuseCorpus
Accessors for TextReuse objects
TextReuseTextDocument
Recompute the tokens for a document or corpus
Split texts into tokens
Count words
Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
Useful links