freq.analysis-methods() R function from [koRpus]

Analyze word frequencies

The function freq.analysis analyzes texts regarding frequencies of tokens, word classes etc.


freq.analysis(txt.file, ...)

## S4 method for signature 'kRp.text'
freq.analysis(
  txt.file,
  corp.freq = NULL,
  desc.stat = TRUE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c()
)

Arguments

txt.file: An object of class kRp.text.
...: Additional options for the generic.
corp.freq: An object of class kRp.corp.freq.
desc.stat: Logical, whether an updated descriptive statistical analysis should be conducted.
corp.rm.class: A character vector with word classes which should be ignored for frequency analysis. The default value "nonpunct" has special meaning and will cause the result of kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE) to be used.
corp.rm.tag: A character vector with POS tags which should be ignored for frequency analysis.

Returns

An updated object of class kRp.text with the added feature freq, which is a list with information on the word frequencies of the analyzed text. Use corpusFreq to get that slot.

Details

It adds new columns with frequency information to the tokens data frame of the input data, describing how often the particular token is used in the additionally provided corpus frequency object.

To get the results, you can use taggedText to get the tokens slot, describe to get the raw descriptive statistics (only updated if desc.stat=TRUE), and corpusFreq to get the data from the added freq feature.

If corp.freq provides appropriate idf values for the types in txt.file, the term frequency--inverse document frequency statistic (tf-idf) will also be computed. Missing idf values will result in NA.

Examples


# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # call freq.analysis() on a tokenized text
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # the token slot before frequency analysis
  head(taggedText(tokenized.obj))

  # instead of data from a larger corpus, we'll
  # use the token frequencies of the text itself
  tokenized.obj <- freq.analysis(
    tokenized.obj,
    corp.freq=read.corp.custom(tokenized.obj)
  )
  # compare the columns after the anylsis
  head(taggedText(tokenized.obj))

  # the object now has further statistics in a
  # new feature slot called freq
  hasFeature(tokenized.obj)
  corpusFreq(tokenized.obj)
} else {}

freq.analysis-methods function

Analyze word frequencies

Arguments

Returns

Details

Examples

See Also