h2o.word2vec function

Trains a word2vec model on a String column of an H2O data frame

Trains a word2vec model on a String column of an H2O data frame

h2o.word2vec( training_frame = NULL, model_id = NULL, min_word_freq = 5, word_model = c("SkipGram", "CBOW"), norm_model = c("HSM"), vec_size = 100, window_size = 5, sent_sample_rate = 0.001, init_learning_rate = 0.025, epochs = 5, pre_trained = NULL, max_runtime_secs = 0, export_checkpoints_dir = NULL )

Arguments

  • training_frame: Id of the training data frame.
  • model_id: Destination id for this model; auto-generated if not specified.
  • min_word_freq: This will discard words that appear less than times Defaults to 5.
  • word_model: The word model to use (SkipGram or CBOW) Must be one of: "SkipGram", "CBOW". Defaults to SkipGram.
  • norm_model: Use Hierarchical Softmax Must be one of: "HSM". Defaults to HSM.
  • vec_size: Set size of word vectors Defaults to 100.
  • window_size: Set max skip length between words Defaults to 5.
  • sent_sample_rate: Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5) Defaults to 0.001.
  • init_learning_rate: Set the starting learning rate Defaults to 0.025.
  • epochs: Number of training iterations to run Defaults to 5.
  • pre_trained: Id of a data frame that contains a pre-trained (external) word2vec model
  • max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
  • export_checkpoints_dir: Automatically export generated models to this directory.

Examples

## Not run: library(h2o) h2o.init() # Import the CraigslistJobTitles dataset job_titles <- h2o.importFile( "https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv", col.names = c("category", "jobtitle"), col.types = c("String", "String"), header = TRUE ) # Build and train the Word2Vec model words <- h2o.tokenize(job_titles, " ") vec <- h2o.word2vec(training_frame = words) h2o.findSynonyms(vec, "teacher", count = 20) ## End(Not run)
  • Maintainer: Tomas Fryda
  • License: Apache License (== 2.0)
  • Last published: 2024-01-11