These methods implement word hyphenation, based on Liang's algorithm. For details, please refer to the documentation for the generic hyphen method in the sylly package.
words: Either an object of class kRp.text, or a character vector with words to be hyphenated.
hyph.pattern: Either an object of class kRp.hyph.pat, or a valid character string naming the language of the patterns to be used. See details.
min.length: Integer, number of letters a word must have for considering a hyphenation. hyphen will not split words after the first or before the last letter, so values smaller than 4 are not useful.
rm.hyph: Logical, whether appearing hyphens in words should be removed before pattern matching.
corp.rm.class: A character vector with word classes which should be ignored. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used. Relevant only if words
is a valid koRpus object.
corp.rm.tag: A character vector with POS tags which should be ignored. Relevant only if words
is a valid koRpus object.
quiet: Logical. If FALSE, short status messages will be shown.
cache: Logical. hyphen() can cache results to speed up the process. If this option is set to TRUE, the current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment, i.e., they are cleaned at the end of a session. If you want to save these for later use, see the option hyph.cache.file
in set.kRp.env.
as: A character string defining the class of the object to be returned. Defaults to "kRp.hyphen", but can also be set to "data.frame" or "numeric", returning only the central data.frame or the numeric vector of counted syllables, respectively. For the latter two options, you can alternatively use the shortcut methods hyphen_df or hyphen_c. Ignored if as.feature=TRUE.
as.feature: Logical, whether the output should be just the analysis results or the input object with the results added as a feature. Use corpusHyphen to get the results from such an aggregated object. If set to TRUE, as="kRp.hyphen" is automatically set, overwriting other setting of as with a warning.
Returns
An object of class kRp.text, kRp.hyphen, data.frame or a numeric vector, depending on the values of the as and as.feature arguments.
Examples
# code is only run when the english language package can be loadedif(require("koRpus.lang.en", quietly =TRUE)){ sample_file <- file.path( path.package("koRpus"),"examples","corpus","Reality_Winner.txt")# call hyphen on a given english word# "quiet=TRUE" suppresses the progress bar hyphen("interference", hyph.pattern="en", quiet=TRUE)# call hyphen() on a tokenized text tokenized.obj <- tokenize( txt=sample_file, lang="en")# language definition is defined in the object# if you call hyphen() without arguments,# you will get its results directly hyphen(tokenized.obj)# alternatively, you can also store those results as a# feature in the object itself tokenized.obj <- hyphen( tokenized.obj, as.feature=TRUE)# results are now part of the object hasFeature(tokenized.obj) corpusHyphen(tokenized.obj)}else{}
References
Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.