-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Tokenizers By Default #580
Changes from all commits
6d72fe0
ffcae81
eadb210
8a22002
40f227d
d9e3a0f
52e24b5
fd1887f
92d42f4
feb12dd
86a1eca
9041fe5
277225c
3dcaf9e
241d265
4d3df41
efd6638
eb05594
4ed432d
3710884
a5ef7b1
cb2b26f
80d4c1d
85c925a
f4b3301
934ea22
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -66,7 +66,7 @@ class OVCLIExportTestCase(unittest.TestCase): | |||||||||||||||||
) | ||||||||||||||||||
EXPECTED_NUMBER_OF_TOKENIZER_MODELS = { | ||||||||||||||||||
"gpt2": 2, | ||||||||||||||||||
"t5": 0, # failed internal sentencepiece check - no <s> token in the vocab | ||||||||||||||||||
"t5": 0, # no .model file in the repository | ||||||||||||||||||
"albert": 0, # not supported yet | ||||||||||||||||||
"distilbert": 1, # no detokenizer | ||||||||||||||||||
"roberta": 2, | ||||||||||||||||||
|
@@ -125,26 +125,26 @@ def test_exporters_cli(self, task: str, model_type: str): | |||||||||||||||||
for arch in SUPPORTED_ARCHITECTURES | ||||||||||||||||||
if not arch[0].endswith("-with-past") and not arch[1].endswith("-refiner") | ||||||||||||||||||
) | ||||||||||||||||||
@unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available") | ||||||||||||||||||
def test_exporters_cli_tokenizers(self, task: str, model_type: str): | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we shoudl remove : - @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available") There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed and add logic checks for "Test openvino-nightly" stage, where the libraries are not compatible. |
||||||||||||||||||
with TemporaryDirectory() as tmpdir: | ||||||||||||||||||
output = subprocess.check_output( | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should also check whether no error was raised here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With optimum-cli export openvino --model hf-internal-testing/tiny-random-t5 --task text2text-generation ov_model I'm getting the following error :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would like to have this fixed before making the tokenizer export default / merging this PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not a bug, we don't support the Unigram model from I added one more check for the vocab file here: openvinotoolkit/openvino_tokenizers#116 OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model' into: OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: Cannot convert tokenizer of this type without `.model` file. The issue is that the original tokenizer has info about I would argue that this is an edge case of a test repo and most tokenizers with info about the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's reasonable for not all cases to be supported, but for these cases we should disable the tokenizer export and not throw an error as it won't be expected by users and could be a bit confusing in my opinion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we either disable this warning or disable export for unsupported cases ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This warning is for any exception that appeared during conversion. After seeing this message, the user can create an issue for a particular model/tokenizer support. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd agree if the export of the tokenizer was an explicit choice of the user (current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I changed the log level to debug, the message won't be visible by default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @apaniukov |
||||||||||||||||||
f"optimum-cli export openvino --model {MODEL_NAMES[model_type]} --convert-tokenizer --task {task} {tmpdir}", | ||||||||||||||||||
f"optimum-cli export openvino --model {MODEL_NAMES[model_type]} --task {task} {tmpdir}", | ||||||||||||||||||
shell=True, | ||||||||||||||||||
stderr=subprocess.STDOUT, | ||||||||||||||||||
).decode() | ||||||||||||||||||
save_dir = Path(tmpdir) | ||||||||||||||||||
number_of_tokenizers = sum("tokenizer" in file for file in map(str, save_dir.rglob("*.xml"))) | ||||||||||||||||||
self.assertEqual( | ||||||||||||||||||
self.EXPECTED_NUMBER_OF_TOKENIZER_MODELS[model_type], | ||||||||||||||||||
number_of_tokenizers, | ||||||||||||||||||
f"OVT: {is_openvino_tokenizers_available() }", | ||||||||||||||||||
) | ||||||||||||||||||
if not is_openvino_tokenizers_available(): | ||||||||||||||||||
self.assertTrue( | ||||||||||||||||||
"OpenVINO Tokenizers is not available." in output | ||||||||||||||||||
or "OpenVINO and OpenVINO Tokenizers versions are not binary compatible." in output, | ||||||||||||||||||
msg=output, | ||||||||||||||||||
) | ||||||||||||||||||
return | ||||||||||||||||||
Comment on lines
+135
to
+141
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would prefer to have it removed to make sure this is always tested (we should always have a compatible version of
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is added because of this test:
It deletes openvino and installs openvino-nightly, which should be incompatible with the installed tokenizer version, replicating the incompatibility scenario. |
||||||||||||||||||
|
||||||||||||||||||
number_of_tokenizers = sum("tokenizer" in file for file in map(str, Path(tmpdir).rglob("*.xml"))) | ||||||||||||||||||
self.assertEqual(self.EXPECTED_NUMBER_OF_TOKENIZER_MODELS[model_type], number_of_tokenizers, output) | ||||||||||||||||||
|
||||||||||||||||||
if number_of_tokenizers == 1: | ||||||||||||||||||
self.assertTrue("Detokenizer is not supported, convert tokenizer only." in output, output) | ||||||||||||||||||
elif number_of_tokenizers == 0 and task not in ("image-classification", "audio-classification"): | ||||||||||||||||||
self.assertTrue(("OpenVINO Tokenizer export for" in output and "is not supported." in output), output) | ||||||||||||||||||
|
||||||||||||||||||
@parameterized.expand(SUPPORTED_ARCHITECTURES) | ||||||||||||||||||
def test_exporters_cli_fp16(self, task: str, model_type: str): | ||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tokenizer export results in additional files being included which could be confusing for users, can this be moved to to an
openvino_tokenizers
directory ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the original tokenizer, so we replicate an existing pattern here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking about something similar than what's done for SD models is this something that sounds reasonable to you? I'm fine with adding it in a following PR if you prefer not including it in this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot @apaniukov !