Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add openvino tokenizers directory #707

Closed
wants to merge 2 commits into from
Closed

Add openvino tokenizers directory #707

wants to merge 2 commits into from

Conversation

echarlaix
Copy link
Collaborator

@echarlaix echarlaix commented May 14, 2024

As discussed in #580 (comment) when the openvino export of tokenizer was made default in the CLI, it was agreed that we should have them in a openvino_tokenizer directory

@@ -367,7 +368,7 @@ class StoreAttr(object):

if tokenizer is not None:
try:
export_tokenizer(tokenizer, output)
export_tokenizer(tokenizer, output / OV_TOKENIZER_FOLDER)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't do it, it looks very weird <root>/openvino_tokenizers/openvino_tokenizers.xml

Why after exporting from native HF to OpenVINO models / files, the directory layout should be changed?
We want to keep directory layout as is, but just replace original model files with OpenVINO exported ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.

Currently for each tokenizer the following files are being added :
openvino_detokenizer.bin
openvino_detokenizer.xml
openvino_tokenizer.bin
openvino_tokenizer.xml

and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed here and here

Copy link
Collaborator Author

@echarlaix echarlaix May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will merge it once the tests are passing as it's been a while this was supposed to be added, I'm OK to remove it in a following PR but we would need to also remove the tokenizer export being default as well

Copy link
Contributor

@apaniukov apaniukov May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.

Currently for each tokenizer the following files are being added : openvino_detokenizer.bin openvino_detokenizer.xml openvino_tokenizer.bin openvino_tokenizer.xml

and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)

Yes, but the original tokenizer also has 3-5 files for fast and legacy versions in the root:
tokenizer.json
special_tokens_map.json
added_tokens.json (not always)
tokenizer_config.json
tokenizer.model (for sentencepiece-based tokenizers)

When there is tokenizer folder in the original model repo, we already using it for openvino tokenizer as well:

sd/
├── feature_extractor
│   └── preprocessor_config.json
├── model_index.json
├── scheduler
│   └── scheduler_config.json
├── text_encoder
│   ├── config.json
│   ├── openvino_model.bin
│   └── openvino_model.xml
├── tokenizer
│   ├── merges.txt
│   ├── openvino_detokenizer.bin
│   ├── openvino_detokenizer.xml
│   ├── openvino_tokenizer.bin
│   ├── openvino_tokenizer.xml
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── unet
│   ├── config.json
│   ├── openvino_model.bin
│   └── openvino_model.xml
├── vae_decoder

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@echarlaix echarlaix closed this May 15, 2024
@echarlaix echarlaix reopened this May 16, 2024
@echarlaix echarlaix closed this May 16, 2024
@echarlaix echarlaix deleted the ov-tokenizers branch May 29, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants