Home/textproc/py311-tokenizers

py311-tokenizers

0.22.2textproc

Fast state-of-the-art tokenizers optimized for research and production

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Main features: - Train new vocabularies and tokenize, using today's most used tokenizers. - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU. - Easy to use, but also extremely versatile. - Designed for research and production. - Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token. - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

$pkg install py311-tokenizers

github.com/huggingface/tokenizers ↗

Origin

textproc/py-tokenizers

Size

5.21MiB

License

APACHE20

Maintainer

tagattie@FreeBSD.org

Dependencies

3 packages

Required by

5 packages

Dependencies (3)

python311 py311-huggingface-hub oniguruma

Required By (5)

py311-aider_chat py311-anthropic py311-litellm py311-sentence-transformers py311-transformers