Fast Text Tokenization
Byte level decoder
Encoding
BPE model
An implementation of the Unigram algorithm
An implementation of the WordPiece algorithm
NFC normalizer
NFKC normalizer
Byte level pre tokenizer
This pre-tokenizer simply splits using the following regex: `\w+|[^\w...
Generic class for tokenizers
Byte Level post processor
Generic class for decoders
Generic class for tokenization models
Generic class for normalizers
Generic class for processors
Generic training class
Tokenizer
BPE trainer
Unigram tokenizer trainer
WordPiece tokenizer trainer
Interfaces with the 'Hugging Face' tokenizers library to provide implementations of today's most used tokenizers such as the 'Byte-Pair Encoding' algorithm <https://huggingface.co/docs/tokenizers/index>. It's extremely fast for both training new vocabularies and tokenizing texts.