What's Changed
- Add max_length Option to CLI Convert Tool by @apaniukov in #309
- Update Regex For Clean Tokenization Spaces by @apaniukov in #314
- Suppress warnings from 3rd party headers by @ilya-lavrenov in #316
- Update Prepend Regex by @apaniukov in #317
- Add C++ example to README by @helena-intel in #320
- Turn on UTF8Validate.REPLACE by default by @pavel-esir in #322
- make skip_tokens an input for VocabDecode (parametrize detokenization/decoding) by @pavel-esir in #325
- [JS] Add sources for nodejs package of tokenizers by @vishniakov-nikolai in #312
- Port print debug errors only if ENV VAR is set to master by @pavel-esir in #348
- [bug] Fix set tensor name for
attention_mask
by @praasz in #352 - Support GLM Edge and ModernBERT by @apaniukov in #356
- Support BART-G2P Tokenizer by @apaniukov in #359
- Add Tests For WordLevel Tokenizer by @apaniukov in #360
- Add information about full Tokenizers version by @ilya-lavrenov in #365
- Wordpiece Detokenizer Support by @apaniukov in #369
- Write Detailed Version To XML by @apaniukov in #372
New Contributors
- @sfblackl-intel made their first contribution in #330
- @praasz made their first contribution in #352
- @jacekpawlak made their first contribution in #370
Full Changelog: 2024.6.0.0...2025.0.0.0