tokenizers R package [Documentation]

basic-tokenizers

Basic tokenizers

chunk_text

Chunk text into smaller segments

ngram-tokenizers

N-gram tokenizers

ptb-tokenizer

Penn Treebank Tokenizer

shingle-tokenizers

Character shingle tokenizers

stem-tokenizers

Word stem tokenizer

tokenizers

Tokenizers

word-counting

Count words, sentences, characters

Download source package Read PDF manual

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Maintainer: Lincoln Mullen
License: MIT + file LICENSE
Last published: 2022-12-22

Useful links

https://github.com/ropensci/tokenizers/issues
https://docs.ropensci.org/tokenizers/
https://github.com/ropensci/tokenizers

tokenizers0.3.0 package

Functions

Readme

Datasets

Dependencies

Imports

Versions

News