clusterMatch function

clusterMatch

clusterMatch

Creates properly sized clusters for matching, using either alphabetical or word embedding clustering. If using word embedding, the function first creates a word embedding out of the provided vectors, and then runs PCA on the matrix. It then takes the first k dimensions (where k is provided by the user) and k-means is run on that matrix to get the clusters.

clusterMatch(vecA, vecB, nclusters, max.n, word.embed, min.var, iter.max)

Arguments

  • vecA: The character vector from dataset A
  • vecB: The character vector from dataset B
  • nclusters: The number of clusters to create from the provided data. Either nclusters = NULL or max.n = NULL.
  • max.n: The maximum size of either dataset A or dataset B in the largest cluster. Either nclusters = NULL or max.n = NULL
  • word.embed: Whether to use word embedding clustering. Default is FALSE.
  • min.var: The minimum amount of explained variance (maximum = 1) a PCA dimension can provide in order to be included in k-means clustering when using word embedding. Default is .20.
  • iter.max: Maximum number of iterations for the k-means algorithm.

Returns

clusterMatch returns a list of length 3: - clusterA: The cluster assignments for dataset A

  • clusterB: The cluster assignments for dataset B

  • n.clusters: The number of clusters created

  • kmeans: The k-means object output.

  • pca: The PCA object output.

  • dims.pca: The number of dimensions from PCA used for the k-means clustering.

Examples

data(samplematch) cl <- clusterMatch(dfA$firstname, dfB$firstname, nclusters = 3)

Author(s)

Ben Fifield benfifield@gmail.com

  • Maintainer: Ted Enamorado
  • License: GPL (>= 3)
  • Last published: 2023-11-17

Useful links