Probabilistic Record Linkage Using Pretrained Text Embeddings
Test whether two strings match with an LLM prompt.
Compute the dot product between two vectors
Probabilistic Record Linkage Using Pretrained Text Embeddings
Get pretrained text embeddings
Create matrix of embedding similarities
Create a training set
Hand Label A Dataset
Install a MISTRAL API KEY in Your .Renviron
File for Repeated Use
Install an OPENAI API KEY in Your .Renviron
File for Repeated Use
Links datasets through fuzzy string matching using pretrained text embeddings. Produces more accurate record linkage when lexical string distance metrics are a poor guide to match quality (e.g., "Patricia" is more lexically similar to "Patrick" than it is to "Trish"). Capable of performing multilingual record linkage. Methods are described in Ornstein (2025) <doi:10.1017/pan.2025.10016>.
Useful links