Functions for Text Cleansing and Text Analysis
Remove words from a vector based on the number of characters in each w...
Create a vector of English words associated with particular parts of s...
Calls base::tolower(), which converts letters to lowercase. Only inclu...
Weighted count of the words in a vector that are found in another vect...
Convert a data.table column of character vectors into a column with on...
Flag rows in a text.table with specific words
Parts of speech for English words from the Moby Project.
Add a column with the parts of speech for each word in a text.table
Create n-grams
Parts of speech for English words from the Moby Project.
Regular expression that might be used to split strings of text into co...
Regular expression that might be used to split strings of text into co...
Regular expression that might be used to split strings of text into co...
Delete rows in a text.table where the number of identical records with...
Delete rows in a text.table where the number of identical records with...
Delete rows in a text.table where the word has more than a minimum num...
Delete rows in a text.table where the records within a group are not a...
Delete rows in a text.table where the records within a group are also ...
Delete rows in a text.table where the word has a certain part of speec...
Delete rows in a text.table where the record has a certain pattern ind...
Delete rows in a text.table where the word has less than a minimum num...
Remove rows from a text.table with specific words
Generates (pseudo)random strings of the specified char length
Vector of lowercase English stop words.
Detect if there are any words in a vector also found in another vector...
Count the intersecting words in a vector that are found in another vec...
Calculates the intersect divided by union of two vectors of words.
Count the words in a vector that are found in another vector.
Count the words in a vector that are not found in another vector.
Count words from a vector that are found in the same position in anoth...
Count words from a vector that are not found in the same position in a...
Count the words in a vector that don't intersect with another vector (...
Create a list of a vector of unique words found in x and a vector of t...
Combine columns of a data.table into a list in a new column, wraps lis...
Extract words from a vector that are found in another vector.
Extract words from a vector that are not found in another vector.
Extract words from a vector that are found in the same position in ano...
Extract words from a vector that are not found in the same position in...
Remove and replace excess white space from strings.
Remove words from a vector that have more than a maximum number of cha...
Remove and replace non-alphanumeric characters from strings.
Remove and replace non-printable characters from strings.
Remove and replace numbers from strings.
Remove and replace punctuation from strings.
Remove words from a vector that match a regular expression.
Remove words from a vector that don't have a minimum number of charact...
Remove words from a vector of words found in another vector of words.
A framework for text cleansing and analysis. Conveniently prepare and process large amounts of text for analysis. Includes various metrics for word counts/frequencies that scale efficiently. Quickly analyze large amounts of text data using a text.table (a data.table created with one word (or unit of text analysis) per row, similar to the tidytext format). Offers flexibility to efficiently work with text data stored in vectors as well as text data formatted as a text.table.