Bump tokenizers from 0.19.1 to 0.20.0 in /pytorch in the pytorch group #387

dependabot · 2024-09-16T13:26:02Z

Bumps the pytorch group in /pytorch with 1 update: tokenizers.

Updates tokenizers from 0.19.1 to 0.20.0

Release notes

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False
What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in huggingface/tokenizers#1521

[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in huggingface/tokenizers#1513

Make USED_PARALLELISM atomic by @nathaniel-daniel in huggingface/tokenizers#1532

Fixing for clippy 1.78 by @Narsil in huggingface/tokenizers#1548

feat(ci): add trufflehog secrets detection by @McPatate in huggingface/tokenizers#1551

Switch from cached_download to hf_hub_download in tests by @Wauplin in huggingface/tokenizers#1547

Fix "dictionnary" typo by @nprisbrey in huggingface/tokenizers#1511

make sure we don't warn on empty tokens by @ArthurZucker in huggingface/tokenizers#1554

Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in huggingface/tokenizers#1550

Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in huggingface/tokenizers#1569

Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in huggingface/tokenizers#1555

Fix clippy + feature test management. by @Narsil in huggingface/tokenizers#1580

Bump spm_precompiled to 0.1.3 by @MikeIvanichev in huggingface/tokenizers#1571

Add benchmark vs tiktoken by @Narsil in huggingface/tokenizers#1582

Fixing the benchmark. by @Narsil in huggingface/tokenizers#1583

Tiny improvement by @Narsil in huggingface/tokenizers#1585

Enable fancy regex by @Narsil in huggingface/tokenizers#1586

Fixing release CI strict (taken from safetensors). by @Narsil in huggingface/tokenizers#1593

Adding some serialization testing around the wrapper. by @Narsil in huggingface/tokenizers#1594

... (truncated)

Commits

a5adaac version 0.20.0
a8def07 Merge branch 'fix_release' of github.com:huggingface/tokenizers into branch_v...
fe50673 Fix CI
b253835 push cargo
fc3bb76 update dependencies
bfd9cde Perf improvement 16% by removing offsets. (#1587)
bd27fa5 add deserialize for pre tokenizers (#1603)
56c9c70 Tests + Deserialization improvement for normalizers. (#1604)
49dafd7 Fix strip python type (#1602)
bded212 Support None to reset pre_tokenizers and normalizers, and index sequences (...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore <dependency name> major version will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself)
@dependabot ignore <dependency name> minor version will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself)
@dependabot ignore <dependency name> will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself)
@dependabot unignore <dependency name> will remove all of the ignore conditions of the specified dependency
@dependabot unignore <dependency name> <ignore condition> will remove the ignore condition of the specified dependency and ignore conditions

Bumps the pytorch group in /pytorch with 1 update: [tokenizers](https://github.com/huggingface/tokenizers). Updates `tokenizers` from 0.19.1 to 0.20.0 - [Release notes](https://github.com/huggingface/tokenizers/releases) - [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md) - [Commits](huggingface/tokenizers@v0.19.1...v0.20.0) --- updated-dependencies: - dependency-name: tokenizers dependency-type: direct:production update-type: version-update:semver-minor dependency-group: pytorch ... Signed-off-by: dependabot[bot] <support@github.com>

github-actions · 2024-09-16T13:29:06Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

Package

Version

Score

Details

pip/tokenizers

0.20.0

🟢 5

Details

Check	Score	Reason
Code-Review	🟢 8	Found 24/27 approved changesets -- score normalized to 8
Maintained	🟢 10	30 commit(s) and 19 issue activity found in the last 90 days -- score normalized to 10
CII-Best-Practices	⚠️ 0	no effort to earn an OpenSSF best practices badge detected
License	🟢 10	license file detected
Signed-Releases	⚠️ -1	no releases found
Branch-Protection	⚠️ -1	internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration
Binary-Artifacts	🟢 10	no binaries found in the repo
Dangerous-Workflow	🟢 10	no dangerous workflow patterns detected
Token-Permissions	⚠️ 0	detected GitHub workflow tokens with excessive permissions
Security-Policy	⚠️ 0	security policy file not detected
Fuzzing	⚠️ 0	project is not fuzzed
Pinned-Dependencies	⚠️ 0	dependency not pinned by hash detected -- score normalized to 0
Packaging	🟢 10	packaging workflow detected
SAST	⚠️ 0	SAST tool is not run on all commits -- score normalized to 0
Vulnerabilities	⚠️ 0	12 existing vulnerabilities detected

Scanned Manifest Files

pytorch/hf-genai-requirements.txt

tokenizers@0.20.0
tokenizers@0.19.1

dependabot · 2024-09-23T13:36:27Z

Looks like tokenizers is updatable in another way, so this is no longer needed.

dependabot bot requested review from tylertitsworth, jitendra42, sramakintel and sharvil10 as code owners September 16, 2024 13:26

dependabot bot added dependencies Pull requests that update a dependency file python Pull requests that update Python code labels Sep 16, 2024

dependabot bot closed this Sep 23, 2024

dependabot bot deleted the dependabot/pip/pytorch/pytorch-3b2d776fe8 branch September 23, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump tokenizers from 0.19.1 to 0.20.0 in /pytorch in the pytorch group #387

Bump tokenizers from 0.19.1 to 0.20.0 in /pytorch in the pytorch group #387

dependabot bot commented on behalf of github Sep 16, 2024 •

edited

Loading

github-actions bot commented Sep 16, 2024

dependabot bot commented on behalf of github Sep 23, 2024

Bump tokenizers from 0.19.1 to 0.20.0 in /pytorch in the pytorch group #387

Bump tokenizers from 0.19.1 to 0.20.0 in /pytorch in the pytorch group #387

Conversation

dependabot bot commented on behalf of github Sep 16, 2024 • edited Loading

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API

What's Changed

github-actions bot commented Sep 16, 2024

Dependency Review

OpenSSF Scorecard

Scanned Manifest Files

dependabot bot commented on behalf of github Sep 23, 2024

dependabot bot commented on behalf of github Sep 16, 2024 •

edited

Loading