TextData() R function from [Xplortext]

Building textual and contextual tables (TextData)

Creates a textual and contextual working-base (TextData format) from a source-base (data frame format). UTF-8


TextData(base, var.text=NULL, var.agg=NULL, context.quali=NULL, context.quanti= NULL,
 selDoc="ALL", lower=TRUE, remov.number=TRUE,lminword=1, Fmin=Dmin,Dmin=1, Fmax=Inf,
 stop.word.tm=FALSE, idiom="en", stop.word.user=NULL, segment=FALSE,
 sep.weak="default",
 sep.strong="\u005B()\u00BF?./:\u00A1!=;{}\u005D\u2026", seg.nfreq=10, seg.nfreq2=10,
 seg.nfreq3=10, graph=FALSE)

Arguments

base: source data frame with at least one textual column
var.text: vector with index(es) or name(s) of the selected textual column(s) (by default NULL)
var.agg: index or name of the aggregation categorical variable (by default NULL)
context.quali: vector with index(es) or name(s) of the selected categorical variable(s) (by default NULL)
context.quanti: vector with index(es) or name(s) of the selected quantitative variable(s) (by default NULL)
selDoc: vector with index(es) or name(s) of the selected source-documents (rows of the source-base) (by default "ALL")
lower: if TRUE, the corpus is converted into lowercase (by default TRUE)
remov.number: if TRUE, numbers are removed (by default TRUE)
lminword: minimum length of a word to be selected (by default 1)
Fmin: minimum frequency of a word to be selected (by default Dmin)
Dmin: a word has to be used in at least Dmin source-documents to be selected (by default 1)
Fmax: maximum frequency of a word to be selected (by default Inf)
stop.word.tm: if TRUE, stoplist automatically provided in accordance with the idiom (by default FALSE)
idiom: declared idiom for the textual column(s) (by default English "en", see IETF language in package NLP)
stop.word.user: stoplist provided by the user
segment: if TRUE, the repeated segments are identified (by default FALSE)
sep.weak: string with the characters marking out the terms (by default punctuation characters, space and control). See details
sep.strong: string with the characters marking out the repeated segments (by default "[()??./:?!=+;-]")
seg.nfreq: minimum frequency of a more-than-three-words-long repeated segment (by default 10)
seg.nfreq2: minimum frequency of a two-words-long repeated segment (by default 10)
seg.nfreq3: minimum frequency of a three-words-long repeated segment (by default 10)
graph: if TRUE, documents, words and repeated segments barcharts are displayed; use plot.TextData to use more options (by default FALSE)

Returns

A list including: - summGen: general summary

summDoc: document summary
indexW: index of words
DocTerm: working lexical table (non-aggregate or aggregate table depending on var.agg value); working-documents by words table in slam package compressed format
context: contextual variables if context.quali or context.quanti are non-NULL; the structure greatly differs in accordance with the nature of DocTerm table (non-aggregate/ aggregate), see details
info: information about the selection of words
var.agg: a one-column data frame with the values of the aggregation variable; NULL if non-aggregate analysis
SourceTerm: in the case of DocTerm being an aggregate analysis, the source-documents by words table is kept in this data structure, in slam package compressed format
indexS: working-documents by repeated-segments table, in slam package compressed format
remov.docs: vector with the names of the removed empty source-documents
VCr: Cramer's V coefficient of document x term matrix
Inertia: total inertia of document x term matrix

Details

Each row of the source-base is considered as a source-document. TextData function builds the working-documents-by-words table, submitted to the analysis.

sep.weak contains the string with the characters marking out the terms (by default punctuation characters, space and control). Backslash or double backslash are used to start an escape sequence defining special characters. Each special character must by separated the symbol | (or) in sep.weak and sep.strong. The default is: sep.weak = ("[%`:*$&#/^|<=>;'+@.,~?(){}|[[:space:]]|\u2014|\u002D|\u00A1|\u0021|\u00BF|\u00AB|\u00BB|\u2026|\u0022|\u005D|\u0097")

Some special characters can be introduced as unicode characters. Back slash (escape contol) is not allowed.

Information related to context.quanti and context.quali arguments:

If numeric, contextual variables can be included in both vectors. The function TextData converts the numeric variable into factor to include it in context.quali vector. This possibility is interesting in some cases. For example, when treating open-ended questions, we can be interested in computing the correlation between the contextual variable "Age" and the axes and, at the same time, to draw the trajectory of the different values of "Age" (year by year) on the CA maps.
In the case of one or several columns with textual data not selected in vector var.text, if the argument context.quali is equal to "ALL", these columns will be considered as categorical variables.

Non-aggregate table versus aggregate table.

If var.agg=NULL:

The work-documents are the non-empty-source-documents.
DocTerm: non-aggregate lexical table with:

as many rows as non-empty source-documents
as many columns as words are selected.
context$quali: data frame crossing the non-empty source-documents (rows) and the categorical contextual-variables (columns).
context$quanti: data frame crossing the non-empty source-documents (rows) and the quantitative contextual-variables (columns). Both contextual tables can be juxtaposed row-wise to DocTerm table.


	as many rows as non-empty source-documents
	as many columns as words are selected.

If var.agg is NON-NULL:

The work-documents are aggregate-documents, issued from aggregating the source-documents depending on the categories of the aggregation variable; the aggregate-documents inherit the names of the corresponding categories.
DocTerm is an aggregate table with:

as many rows as as categories the aggregation variable has
as many columns as words are selected.
context $quali$ qualitable: juxtaposes as many supplementary aggregate tables as categorical contextual variables. Each table has:

as many rows as categories the contextual categorical variable has
as many columns as selected words, i.e. as many columns as DocTerm has.
context $quali$ qualivar: names of categories of the supplementary categorical variables.
context$quanti: data frame crossing the working aggregate-documents (rows) and the quantitative contextual-variables (columns). The value for an active aggregate-document is the mean-value of the source-documents belonging to this aggregate-document.


	as many rows as as categories the aggregation variable has
	as many columns as words are selected.


	as many rows as categories the contextual categorical variable has
	as many columns as selected words, i.e. as many columns as DocTerm has.

References

Lebart, L., Salem, A., & Berry, L. (1998). Exploring textual data. (D. Kluwer, Ed.). tools:::Rd_expr_doi("10.1007/978-94-017-1525-6") .

Author(s)

Ramón Alvarez-Esteban ramon.alvarez@unileon.es , Monica Bécue-Bertaut, Josep-Antón Sánchez-Espigares

Examples


# Non aggregate analysis
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), remov.number=TRUE, Fmin=10, Dmin=10,  
 stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), context.quanti=c("Age"))

# Aggregate analysis and repeated segments
data(open.question)
res.TD<-TextData(open.question, var.text=c(9,10), var.agg="Gen_Age", remov.number=TRUE, 
 Fmin=10, Dmin=10, stop.word.tm=TRUE, context.quali=c("Gender","Age_Group","Education"), 
 context.quanti=c("Age"), segment=TRUE)

Xplortext package Read PDF manual

Maintainer: Ramón Alvarez-Esteban
License: GPL (>= 2.0)
Last published: 2024-11-13
https://xplortext.unileon.es

TextData function