orthoCoding function

Code a character string (written word form) as letter n-grams

Code a character string (written word form) as letter n-grams

orthoCoding codes a character string into unigrams, bigrams, ..., n-grams, with as default bigrams as the substring size. If tokenization is not at the letter/character level, a token separator can be provided.

orthoCoding(strings=c("hel.lo","wor.ld"), grams = c(2), tokenized = F, sepToken = '.')

Arguments

  • strings: A character vector of strings (usually words) to be recoded as n-grams.
  • grams: A vector of numbers, each one a size of ngram to be produced. For example a vector like grams=c(1,3) will create the unigram and trigram cues from the input.
  • tokenized: If tokenzied is FALSE (the default), the input strings are split into letters/characters. If it is set to TRUE, the strings will be split up based on the value of sepToken.
  • sepToken: A string that defines which character will be used to separate tokens when tokenized is TRUE. Defaults to the "." character.

Returns

A vector of grams (joined by underscores), one for each word in the input vector words .

References

Baayen, R. H. and Milin, P. and Filipovic Durdevic, D. and Hendrix, P. and Marelli, M., An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118, 438-482.

Author(s)

Cyrus Shaoul, Peter Hendrix and Harald Baayen

See Also

See also estimateWeights.

Examples

#Default orthoCoding(tokenize=FALSE) #With tokenizing on a specific character orthoCoding(tokenize=TRUE) #Comparing different n-gram sizes data(serbian) serbian$Cues=orthoCoding(serbian$WordForm, grams=2) head(serbian$Cues) serbian$Cues=orthoCoding(serbian$WordForm, grams=c(2,4)) head(serbian$Cues)
  • Maintainer: Tino Sering
  • License: GPL-3
  • Last published: 2018-09-10

Useful links