sbo_predictions function

Stupid Back-off text predictions

Stupid Back-off text predictions

Train a text predictor via Stupid Back-off

sbo_predictor(object, ...) predictor(object, ...) ## S3 method for class 'character' sbo_predictor( object, N, dict, .preprocess = identity, EOS = "", lambda = 0.4, L = 3L, filtered = "<UNK>", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predictor(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'sbo_predtable' sbo_predictor(object, ...) sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...) ## S3 method for class 'character' sbo_predtable( object, lambda = 0.4, L = 3L, filtered = "<UNK>", N, dict, .preprocess = identity, EOS = "", ... ) ## S3 method for class 'sbo_kgram_freqs' sbo_predtable(object, lambda = 0.4, L = 3L, filtered = "<UNK>", ...)

Arguments

  • object: either a character vector or an object inheriting from classes sbo_kgram_freqs or sbo_predtable. Defines the method to use for training.
  • ...: further arguments passed to or from other methods.
  • N: a length one integer. Order 'N' of the N-gram model.
  • dict: a sbo_dictionary, a character vector or a formula. For more details see kgram_freqs.
  • .preprocess: a function for corpus preprocessing. For more details see kgram_freqs.
  • EOS: a length one character vector. String listing End-Of-Sentence characters. For more details see kgram_freqs.
  • lambda: a length one numeric. Penalization in the Stupid Back-off algorithm.
  • L: a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained).
  • filtered: a character vector. Words to exclude from next-word predictions. The strings '' and '' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively.

Returns

A sbo_predictor object for sbo_predictor(), a sbo_predtable object for sbo_predtable().

Details

These functions are generics used to train a text predictor with Stupid Back-Off. The functions predictor() and predtable() are aliases for sbo_predictor() and sbo_predtable(), respectively.

The sbo_predictor data structure carries all information required for prediction in a compact and efficient (upon retrieval) way, by directly storing the top L next-word predictions for each k-gram prefix observed in the training corpus.

The sbo_predictor objects are for interactive use. If the training process is computationally heavy, one can store a "raw" version of the text predictor in a sbo_predtable class object, which can be safely saved out of memory (with e.g. save()). The resulting object can be restored in another R session, and the corresponding sbo_predictor object can be loaded rapidly using again the generic constructor sbo_predictor() (see example below).

The returned objects are a sbo_predictor and a sbo_predtable

objects. The latter contains Stupid Back-Off prediction tables, storing next-word prediction for each k-gram prefix observed in the text, whereas the former is an external pointer to an equivalent (but processed) C++ structure.

Both objects have the following attributes:

  • N: The order of the underlying N-gram model, "N".
  • dict: The model dictionary.
  • lambda: The penalization used in the Stupid Back-Off algorithm.
  • L: The maximum number of next-word predictions for a given text input.
  • .preprocess: The function used for text preprocessing.
  • EOS: A length one character vector listing all (single character) end-of-sentence tokens.

Examples

# Train a text predictor directly from corpus p <- sbo_predictor(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Train a text predictor from previously computed 'kgram_freqs' object p <- sbo_predictor(twitter_freqs) # Load a text predictor from a Stupid Back-Off prediction table p <- sbo_predictor(twitter_predtable) # Predict from Stupid Back-Off text predictor p <- sbo_predictor(twitter_predtable) predict(p, "i love") # Build Stupid Back-Off prediction tables directly from corpus t <- sbo_predtable(twitter_train, N = 3, dict = max_size ~ 1000, .preprocess = preprocess, EOS = ".?!:;") # Build Stupid Back-Off prediction tables from kgram_freqs object t <- sbo_predtable(twitter_freqs) ## Not run: # Save and reload a 'sbo_predtable' object with base::save() save(t) load("t.rda") ## End(Not run)

See Also

predict.sbo_predictor

Author(s)

Valerio Gherardi