udpipe0.8.11 package

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Load an UDPipe model

Read in a CONLL-U file as a data.frame

Train a UDPipe model

Create a unique identifier for each combination of fields in a data fr...

Create a data.frame from a list of tokens

Convert the result of udpipe_annotate to a tidy data frame

Convert the result of cooccurrence to a sparse matrix

Convert a data.frame to CONLL-U format

Convert a matrix to a co-occurrence data.frame

Combine labels and text as used in fasttext

Convert Parts of Speech tags to one-letter tags which can be used to i...

Convert a matrix of word vectors to word2vec format

Add the dependency parsing information to an annotated dataset

Add morphological features to an annotated dataset

Create a cooccurence data.frame

Aggregate a data.frame to the document/term level by calculating how m...

Add Term Frequency, Inverse Document Frequency and Okapi BM25 statisti...

Create a document/term matrix

Reorder a Document-Term-Matrix alongside a vector or data.frame

Combine 2 document term matrices either by rows or by columns

Compare term usage across 2 document groups using the Chi-square Test ...

Column sums and Row sums for document term matrices

Make sure a document term matrix has exactly the specified rows and co...

Pearson Correlation for Sparse Matrices

Remove terms occurring with low frequency from a Document-Term-Matrix ...

Remove terms with high sparsity from a Document-Term-Matrix

Remove terms from a Document-Term-Matrix and keep only documents which...

Remove terms from a Document-Term-Matrix and documents with no terms b...

Inverse operation of the document_term_matrix function

Random samples and permutations from a Document-Term-Matrix

Semantic Similarity to a Singular Value Decomposition

Term Frequency - Inverse Document Frequency calculation

Extract collocations - a sequence of terms which follow each other

Extract phrases - a sequence of terms which follow each other based on...

Keyword identification using Rapid Automatic Keyword Extraction (RAKE)

Concatenate text of each group of data together

Predict method for an object of class LDA_VEM or class LDA_Gibbs

Obtain a tokenised data frame by splitting text alongside a regular ex...

Experimental and undocumented querying of syntax patterns

Experimental and undocumented querying of syntax relationships

Collapse a character vector while removing missing data.

Check if text contains a certain pattern

Based on a vector with a word sequence, get n-grams (looking forward +...

Count the number of times a pattern is occurring in text

Frequency statistics of elements in a vector

Look up a multiple patterns and indicate their presence in text

Highlight words in a character vector

Get the n-th next element of a vector

Based on a vector with a word sequence, get n-grams (looking forward)

Get the overlap between 2 vectors

Concatenate strings with options how to handle missing data

Get the n-th previous element of a vector

Based on a vector with a word sequence, get n-grams (looking backward)

Recode text to other categories

Recode words with compound multi-word expressions

Boilerplate function to sample one element from a vector.

Perform dictionary-based sentiment analysis on a tokenised data frame

Boilerplate function to cat only 1 element of a character vector.

Identify a contiguous sequence of tags as 1 being entity

Tokenising, Lemmatising, Tagging and Dependency Parsing of raw text in...

Evaluate the accuracy of your UDPipe model on holdout data

Tokenising, Lemmatising, Tagging and Dependency Parsing Annotation of ...

Download an UDPipe model provided by the UDPipe community for a specif...

This natural language processing toolkit provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization' and 'dependency parsing' of raw text. Next to text parsing, the package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at <https://universaldependencies.org/format.html>. The techniques are explained in detail in the paper: 'Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe', available at <doi:10.18653/v1/K17-3009>. The toolkit also contains functionalities for commonly used data manipulations on texts which are enriched with the output of the parser. Namely functionalities and algorithms for collocations, token co-occurrence, document term matrix handling, term frequency inverse document frequency calculations, information retrieval metrics (Okapi BM25), handling of multi-word expressions, keyword detection (Rapid Automatic Keyword Extraction, noun phrase extraction, syntactical patterns) sentiment scoring and semantic similarity analysis.

Maintainer: Jan Wijffels License: MPL-2.0 Last published: 2023-01-06