-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add openvino tokenizers directory #707
Conversation
@@ -367,7 +368,7 @@ class StoreAttr(object): | |||
|
|||
if tokenizer is not None: | |||
try: | |||
export_tokenizer(tokenizer, output) | |||
export_tokenizer(tokenizer, output / OV_TOKENIZER_FOLDER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, don't do it, it looks very weird <root>/openvino_tokenizers/openvino_tokenizers.xml
Why after exporting from native HF to OpenVINO models / files, the directory layout should be changed?
We want to keep directory layout as is, but just replace original model files with OpenVINO exported ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.
Currently for each tokenizer the following files are being added :
openvino_detokenizer.bin
openvino_detokenizer.xml
openvino_tokenizer.bin
openvino_tokenizer.xml
and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will merge it once the tests are passing as it's been a while this was supposed to be added, I'm OK to remove it in a following PR but we would need to also remove the tokenizer export being default as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.
Currently for each tokenizer the following files are being added :
openvino_detokenizer.bin
openvino_detokenizer.xml
openvino_tokenizer.bin
openvino_tokenizer.xml
and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)
Yes, but the original tokenizer also has 3-5 files for fast and legacy versions in the root:
tokenizer.json
special_tokens_map.json
added_tokens.json
(not always)
tokenizer_config.json
tokenizer.model
(for sentencepiece-based tokenizers)
When there is tokenizer
folder in the original model repo, we already using it for openvino tokenizer as well:
sd/
├── feature_extractor
│ └── preprocessor_config.json
├── model_index.json
├── scheduler
│ └── scheduler_config.json
├── text_encoder
│ ├── config.json
│ ├── openvino_model.bin
│ └── openvino_model.xml
├── tokenizer
│ ├── merges.txt
│ ├── openvino_detokenizer.bin
│ ├── openvino_detokenizer.xml
│ ├── openvino_tokenizer.bin
│ ├── openvino_tokenizer.xml
│ ├── special_tokens_map.json
│ ├── tokenizer_config.json
│ └── vocab.json
├── unet
│ ├── config.json
│ ├── openvino_model.bin
│ └── openvino_model.xml
├── vae_decoder
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
As discussed in #580 (comment) when the openvino export of tokenizer was made default in the CLI, it was agreed that we should have them in a
openvino_tokenizer
directory