corpustools R package [Documentation]

add_multitoken_label

Choose and add multitoken strings based on multitoken categories

agg_label

Helper function for aggregate_rsyntax

agg_tcorpus

Aggregate the tokens data

aggregate_rsyntax

Aggregate rsyntax annotations

as.tcorpus.default

Force an object to be a tCorpus class

as.tcorpus

Force an object to be a tCorpus class

as.tcorpus.tCorpus

Force an object to be a tCorpus class

backbone_filter

Extract the backbone of a network.

browse_hits

View hits in a browser

browse_texts

Create and view a full text browser

calc_chi2

Vectorized computation of chi^2 statistic for a 2x2 crosstab containin...

compare_corpus

Compare tCorpus vocabulary to that of another (reference) tCorpus

compare_documents

Calculate the similarity of documents

compare_subset

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

count_tcorpus

Count results of search hits, or of a given feature in tokens

create_tcorpus

Create a tCorpus

docfreq_filter

Support function for subset method

dtm_compare

Compare two document term matrices

dtm_wordcloud

Plot a word cloud from a dtm

ego_semnet

Create an ego network

export_span_annotations

Export span annotations

feature_associations

Get common nearby features given a query or query hits

feature_stats

Feature statistics

fold_rsyntax

Fold rsyntax annotations

freq_filter

Support function for subset method

get_dtm

Create a document term matrix.

get_global_i

Compute global feature positions

get_kwic

Get keyword-in-context (KWIC) strings

get_stopwords

Get a character vector of stopwords

laplace

Laplace (i.e. add constant) smoothing

melt_quanteda_dict

Convert a quanteda dictionary to a long data.table format

merge_tcorpora

Merge tCorpus objects

plot_semnet

Visualize a semnet network

plot_words

Plot a wordcloud with words ordered and coloured according to a dimens...

plot.contextHits

S3 plot for contextHits class

plot.featureAssociations

visualize feature associations

plot.featureHits

S3 plot for featureHits class

plot.vocabularyComparison

visualize vocabularyComparison

preprocess_tokens

Preprocess tokens in a character vector

print.contextHits

S3 print for contextHits class

print.featureHits

S3 print for featureHits class

print.tCorpus

S3 print for tCorpus class

refresh_tcorpus

Refresh a tCorpus object using the current version of corpustools

require_package

Check if package with given version exists

search_contexts

Search for documents or sentences using Boolean queries

search_dictionary

Dictionary lookup

search_features

Find tokens using a Lucene-like search query

semnet_window

Create a semantic network based on the co-occurence of tokens in token...

semnet

Create a semantic network based on the co-occurence of tokens in docum...

set_network_attributes

Set some default network attributes for pretty plotting

sgt

Simple Good Turing smoothing

show_udpipe_models

Show the names of udpipe models

subset_query

Subset tCorpus token data using a query

subset.tCorpus

S3 subset for tCorpus class

summary.contextHits

S3 summary for contextHits class

summary.featureHits

S3 summary for featureHits class

summary.tCorpus

Summary of a tCorpus object

tc_plot_tree

Visualize a dependency tree

tCorpus_compare

Corpus comparison

tCorpus_create

Creating a tCorpus

tCorpus_data

Methods and functions for viewing, modifying and subsetting tCorpus da...

tCorpus_docsim

Document similarity

tCorpus_features

Preprocessing, subsetting and analyzing features

tCorpus_modify_by_reference

Modify tCorpus by reference

tCorpus_querying

Use Boolean queries to analyze the tCorpus

tCorpus_semnet

Feature co-occurrence based semantic network analysis

tCorpus_topmod

Topic modeling

tCorpus-cash-annotate_rsyntax

Annotate tokens based on rsyntax queries

tCorpus-cash-code_dictionary

Dictionary lookup

tCorpus-cash-code_features

Code features in a tCorpus based on a search string

tCorpus-cash-context

Get a context vector

tCorpus-cash-deduplicate

Deduplicate documents

tCorpus-cash-delete_columns

Delete column from the data and meta data

tCorpus-cash-feats_to_columns

Cast the "feats" column in UDpipe tokens to columns

tCorpus-cash-feature_subset

Filter features

tCorpus-cash-fold_rsyntax

Fold rsyntax annotations

tCorpus-cash-get

Access the data from a tCorpus

tCorpus-cash-lda_fit

Estimate a LDA topic model

tCorpus-cash-merge

Merge the token and meta data.tables of a tCorpus with another data.fr...

tCorpus-cash-preprocess

Preprocess feature

tCorpus-cash-replace_dictionary

Replace tokens with dictionary match

tCorpus-cash-search_recode

Recode features in a tCorpus based on a search string

tCorpus-cash-set_levels

Change levels of factor columns

tCorpus-cash-set_name

Change column names of data and meta data

tCorpus-cash-set

Modify the token and meta data.tables of a tCorpus

tCorpus-cash-subset_query

Subset tCorpus token data using a query

tCorpus-cash-subset

Subset a tCorpus

tCorpus-cash-udpipe_clauses

Add columns indicating who did what

tCorpus-cash-udpipe_quotes

Add columns indicating who said what

tCorpus

tCorpus: a corpus class for tokenized texts

tokens_to_tcorpus

Create a tcorpus based on tokens (i.e. preprocessed texts)

tokenWindowOccurence

Gives the window in which a term occured in a matrix.

top_features

Show top features

transform_rsyntax

Apply rsyntax transformations

udpipe_clause_tqueries

Get a list of tqueries for extracting who did what

udpipe_quote_tqueries

Get a list of tqueries for extracting quotes

udpipe_simplify

Simplify tokenIndex created with the udpipe parser

udpipe_spanquote_tqueries

Get a list of tqueries for finding candidates for span quotes.

udpipe_tcorpus

Create a tCorpus using udpipe

untokenize

Reconstruct original texts

Download source package Read PDF manual

Provides text analysis in R, focusing on the use of a tokenized text format. In this format, the positions of tokens are maintained, and each token can be annotated (e.g., part-of-speech tags, dependency relations). Prominent features include advanced Lucene-like querying for specific tokens or contexts (e.g., documents, sentences), similarity statistics for words and documents, exporting to DTM for compatibility with many text analysis packages, and the possibility to reconstruct original text from tokens to facilitate interpretation.

Maintainer: Kasper Welbers
License: GPL-3
Last published: 2025-07-07
https://github.com/kasperwelbers/corpustools

corpustools0.5.2 package

Functions

Readme

Datasets

Dependencies

Imports

Versions

News

add_multitoken_label

agg_label

agg_tcorpus

aggregate_rsyntax

as.tcorpus.default

as.tcorpus

as.tcorpus.tCorpus

backbone_filter

browse_hits

browse_texts

calc_chi2

compare_corpus

compare_documents

compare_subset

count_tcorpus

create_tcorpus

docfreq_filter

dtm_compare

dtm_wordcloud

ego_semnet

export_span_annotations

feature_associations

feature_stats

fold_rsyntax

freq_filter

get_dtm

get_global_i

get_kwic

get_stopwords

laplace

melt_quanteda_dict

merge_tcorpora

plot_semnet

plot_words

plot.contextHits

plot.featureAssociations

plot.featureHits

plot.vocabularyComparison

preprocess_tokens

print.contextHits

print.featureHits

print.tCorpus

refresh_tcorpus

require_package

search_contexts

search_dictionary

search_features

semnet_window

semnet

set_network_attributes

sgt

show_udpipe_models

subset_query

subset.tCorpus

summary.contextHits

summary.featureHits

summary.tCorpus

tc_plot_tree

tCorpus_compare

tCorpus_create

tCorpus_data

tCorpus_docsim

tCorpus_features

tCorpus_modify_by_reference

tCorpus_querying

tCorpus_semnet

tCorpus_topmod

tCorpus-cash-annotate_rsyntax

tCorpus-cash-code_dictionary

tCorpus-cash-code_features

tCorpus-cash-context

tCorpus-cash-deduplicate