Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit
Load an UDPipe model
Read in a CONLL-U file as a data.frame
Train a UDPipe model
Create a unique identifier for each combination of fields in a data fr...
Create a data.frame from a list of tokens
Convert the result of udpipe_annotate to a tidy data frame
Convert the result of cooccurrence to a sparse matrix
Convert a data.frame to CONLL-U format
Convert a matrix to a co-occurrence data.frame
Combine labels and text as used in fasttext
Convert Parts of Speech tags to one-letter tags which can be used to i...
Convert a matrix of word vectors to word2vec format
Add the dependency parsing information to an annotated dataset
Add morphological features to an annotated dataset
Create a cooccurence data.frame
Aggregate a data.frame to the document/term level by calculating how m...
Add Term Frequency, Inverse Document Frequency and Okapi BM25 statisti...
Create a document/term matrix
Reorder a Document-Term-Matrix alongside a vector or data.frame
Combine 2 document term matrices either by rows or by columns
Compare term usage across 2 document groups using the Chi-square Test ...
Column sums and Row sums for document term matrices
Make sure a document term matrix has exactly the specified rows and co...
Pearson Correlation for Sparse Matrices
Remove terms occurring with low frequency from a Document-Term-Matrix ...
Remove terms with high sparsity from a Document-Term-Matrix
Remove terms from a Document-Term-Matrix and keep only documents which...
Remove terms from a Document-Term-Matrix and documents with no terms b...
Inverse operation of the document_term_matrix function
Random samples and permutations from a Document-Term-Matrix
Semantic Similarity to a Singular Value Decomposition
Term Frequency - Inverse Document Frequency calculation
Extract collocations - a sequence of terms which follow each other
Extract phrases - a sequence of terms which follow each other based on...
Keyword identification using Rapid Automatic Keyword Extraction (RAKE)
Concatenate text of each group of data together
Predict method for an object of class LDA_VEM or class LDA_Gibbs
Obtain a tokenised data frame by splitting text alongside a regular ex...
Experimental and undocumented querying of syntax patterns
Experimental and undocumented querying of syntax relationships
Collapse a character vector while removing missing data.
Check if text contains a certain pattern
Based on a vector with a word sequence, get n-grams (looking forward +...
Count the number of times a pattern is occurring in text
Frequency statistics of elements in a vector
Look up a multiple patterns and indicate their presence in text
Highlight words in a character vector
Get the n-th next element of a vector
Based on a vector with a word sequence, get n-grams (looking forward)
Get the overlap between 2 vectors
Concatenate strings with options how to handle missing data
Get the n-th previous element of a vector
Based on a vector with a word sequence, get n-grams (looking backward)
Recode text to other categories
Recode words with compound multi-word expressions
Boilerplate function to sample one element from a vector.
Perform dictionary-based sentiment analysis on a tokenised data frame
Boilerplate function to cat only 1 element of a character vector.
Identify a contiguous sequence of tags as 1 being entity
Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in...
Evaluate the accuracy of your UDPipe model on holdout data
Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of ...
Download an UDPipe model provided by the UDPipe community for a specif...
This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.