Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Tokenizers By Default #580
Convert Tokenizers By Default #580
Changes from 7 commits
6d72fe0
ffcae81
eadb210
8a22002
40f227d
d9e3a0f
52e24b5
fd1887f
92d42f4
feb12dd
86a1eca
9041fe5
277225c
3dcaf9e
241d265
4d3df41
efd6638
eb05594
4ed432d
3710884
a5ef7b1
cb2b26f
80d4c1d
85c925a
f4b3301
934ea22
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shoudl remove :
- @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and add logic checks for "Test openvino-nightly" stage, where the libraries are not compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also check whether no error was raised here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With
optimum-cli export openvino --model hf-internal-testing/tiny-random-t5 --task text2text-generation ov_model
I'm getting the following error :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to have this fixed before making the tokenizer export default / merging this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a bug, we don't support the Unigram model from
tokenizer.json
yet, only from sentencepiece model file. If you use a repository that has such a file, the tokenizer will be converted.I added one more check for the vocab file here: openvinotoolkit/openvino_tokenizers#116
This will transform
into:
The issue is that the original tokenizer has info about

.model
file in the tokenizer object, but it does not have it in practice:I would argue that this is an edge case of a test repo and most tokenizers with info about the
.model
file have it, so it should not block the merge.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's reasonable for not all cases to be supported, but for these cases we should disable the tokenizer export and not throw an error as it won't be expected by users and could be a bit confusing in my opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we either disable this warning or disable export for unsupported cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is for any exception that appeared during conversion. After seeing this message, the user can create an issue for a particular model/tokenizer support.
I also prefer to tell the user that the tokenizer is not converted, rather than silently not including it without warning, so the lack of a (de)tokenizer model won't be a surprise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree if the export of the tokenizer was an explicit choice of the user (current
--convert-tokenizer
) but as this PR makes it default I think having such warning can be an issue as it makes it look like the export failedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the log level to debug, the message won't be visible by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @apaniukov