sentencepiece R package [Documentation]

BPEembed

Tokenise and embed text alongside a Sentencepiece and Word2vec model

BPEembedder

Build a BPEembed model containing a Sentencepiece and Word2vec model

predict.BPEembed

Encode and Decode alongside a BPEembed model

read_word2vec

Read a word2vec embedding file

sentencepiece

Construct a Sentencepiece model

sentencepiece_decode

Decode encoded sequences back to text

sentencepiece_download_model

Download a Sentencepiece model

sentencepiece_encode

Tokenise text alongside a Sentencepiece model

sentencepiece_load_model

Load a Sentencepiece model

txt_remove_

Remove prefixed underscore

wordpiece_encode

Wordpiece encoding

Download source package Read PDF manual

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Maintainer: Jan Wijffels
License: MPL-2.0
Last published: 2022-11-13
https://github.com/bnosac/sentencepiece

sentencepiece0.2.3 package

Functions

Readme

Dependencies

Imports

Versions

News