sbo_predictor(object,...)predictor(object,...)## S3 method for class 'character'sbo_predictor( object, N, dict, .preprocess = identity, EOS ="", lambda =0.4, L =3L, filtered ="<UNK>",...)## S3 method for class 'sbo_kgram_freqs'sbo_predictor(object, lambda =0.4, L =3L, filtered ="<UNK>",...)## S3 method for class 'sbo_predtable'sbo_predictor(object,...)sbo_predtable(object, lambda =0.4, L =3L, filtered ="<UNK>",...)predtable(object, lambda =0.4, L =3L, filtered ="<UNK>",...)## S3 method for class 'character'sbo_predtable( object, lambda =0.4, L =3L, filtered ="<UNK>", N, dict, .preprocess = identity, EOS ="",...)## S3 method for class 'sbo_kgram_freqs'sbo_predtable(object, lambda =0.4, L =3L, filtered ="<UNK>",...)
Arguments
object: either a character vector or an object inheriting from classes sbo_kgram_freqs or sbo_predtable. Defines the method to use for training.
...: further arguments passed to or from other methods.
N: a length one integer. Order 'N' of the N-gram model.
dict: a sbo_dictionary, a character vector or a formula. For more details see kgram_freqs.
.preprocess: a function for corpus preprocessing. For more details see kgram_freqs.
EOS: a length one character vector. String listing End-Of-Sentence characters. For more details see kgram_freqs.
lambda: a length one numeric. Penalization in the Stupid Back-off algorithm.
L: a length one integer. Maximum number of next-word predictions for a given input (top scoring predictions are retained).
filtered: a character vector. Words to exclude from next-word predictions. The strings '' and '' are reserved keywords referring to the Unknown-Word and End-Of-Sentence tokens, respectively.
Returns
A sbo_predictor object for sbo_predictor(), a sbo_predtable object for sbo_predtable().
Details
These functions are generics used to train a text predictor with Stupid Back-Off. The functions predictor() and predtable() are aliases for sbo_predictor() and sbo_predtable(), respectively.
The sbo_predictor data structure carries all information required for prediction in a compact and efficient (upon retrieval) way, by directly storing the top L next-word predictions for each k-gram prefix observed in the training corpus.
The sbo_predictor objects are for interactive use. If the training process is computationally heavy, one can store a "raw" version of the text predictor in a sbo_predtable class object, which can be safely saved out of memory (with e.g. save()). The resulting object can be restored in another R session, and the corresponding sbo_predictor object can be loaded rapidly using again the generic constructor sbo_predictor() (see example below).
The returned objects are a sbo_predictor and a sbo_predtable
objects. The latter contains Stupid Back-Off prediction tables, storing next-word prediction for each k-gram prefix observed in the text, whereas the former is an external pointer to an equivalent (but processed) C++ structure.
Both objects have the following attributes:
N: The order of the underlying N-gram model, "N".
dict: The model dictionary.
lambda: The penalization used in the Stupid Back-Off algorithm.
L: The maximum number of next-word predictions for a given text input.
.preprocess: The function used for text preprocessing.
EOS: A length one character vector listing all (single character) end-of-sentence tokens.
Examples
# Train a text predictor directly from corpusp <- sbo_predictor(twitter_train, N =3, dict = max_size ~1000, .preprocess = preprocess, EOS =".?!:;")# Train a text predictor from previously computed 'kgram_freqs' objectp <- sbo_predictor(twitter_freqs)# Load a text predictor from a Stupid Back-Off prediction tablep <- sbo_predictor(twitter_predtable)# Predict from Stupid Back-Off text predictorp <- sbo_predictor(twitter_predtable)predict(p,"i love")# Build Stupid Back-Off prediction tables directly from corpust <- sbo_predtable(twitter_train, N =3, dict = max_size ~1000, .preprocess = preprocess, EOS =".?!:;")# Build Stupid Back-Off prediction tables from kgram_freqs objectt <- sbo_predtable(twitter_freqs)## Not run:# Save and reload a 'sbo_predtable' object with base::save()save(t)load("t.rda")## End(Not run)