Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add openvino tokenizers directory #707

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion optimum/exporters/openvino/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,7 @@ class StoreAttr(object):

# hide openvino import when using other exporters
from optimum.exporters.openvino.convert import export_tokenizer
from optimum.intel.openvino.utils import OV_TOKENIZER_FOLDER

if convert_tokenizer and is_openvino_tokenizers_available():
if library_name != "diffusers":
Expand All @@ -372,7 +373,7 @@ class StoreAttr(object):

if tokenizer is not None:
try:
export_tokenizer(tokenizer, output)
export_tokenizer(tokenizer, output / OV_TOKENIZER_FOLDER)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, don't do it, it looks very weird <root>/openvino_tokenizers/openvino_tokenizers.xml

Why after exporting from native HF to OpenVINO models / files, the directory layout should be changed?
We want to keep directory layout as is, but just replace original model files with OpenVINO exported ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.

Currently for each tokenizer the following files are being added :
openvino_detokenizer.bin
openvino_detokenizer.xml
openvino_tokenizer.bin
openvino_tokenizer.xml

and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed here and here

Copy link
Collaborator Author

@echarlaix echarlaix May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will merge it once the tests are passing as it's been a while this was supposed to be added, I'm OK to remove it in a following PR but we would need to also remove the tokenizer export being default as well

Copy link
Contributor

@apaniukov apaniukov May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already discussed when the openvino tokenizer export was made default and was part of the reason why this modifications was accepted.

Currently for each tokenizer the following files are being added : openvino_detokenizer.bin openvino_detokenizer.xml openvino_tokenizer.bin openvino_tokenizer.xml

and we think this is cleaner to have it in a directory, as also done for SD models (for example : https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/tokenizer)

Yes, but the original tokenizer also has 3-5 files for fast and legacy versions in the root:
tokenizer.json
special_tokens_map.json
added_tokens.json (not always)
tokenizer_config.json
tokenizer.model (for sentencepiece-based tokenizers)

When there is tokenizer folder in the original model repo, we already using it for openvino tokenizer as well:

sd/
├── feature_extractor
│   └── preprocessor_config.json
├── model_index.json
├── scheduler
│   └── scheduler_config.json
├── text_encoder
│   ├── config.json
│   ├── openvino_model.bin
│   └── openvino_model.xml
├── tokenizer
│   ├── merges.txt
│   ├── openvino_detokenizer.bin
│   ├── openvino_detokenizer.xml
│   ├── openvino_tokenizer.bin
│   ├── openvino_tokenizer.xml
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── unet
│   ├── config.json
│   ├── openvino_model.bin
│   └── openvino_model.xml
├── vae_decoder

except Exception as exception:
logger.warning(
"Could not load tokenizer using specified model ID or path. OpenVINO tokenizer/detokenizer "
Expand Down
1 change: 1 addition & 0 deletions optimum/intel/openvino/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
OV_DECODER_NAME = "openvino_decoder_model.xml"
OV_DECODER_WITH_PAST_NAME = "openvino_decoder_with_past_model.xml"

OV_TOKENIZER_FOLDER = "openvino_tokenizer"
OV_TOKENIZER_NAME = "openvino_tokenizer{}.xml"
OV_DETOKENIZER_NAME = "openvino_detokenizer{}.xml"

Expand Down
6 changes: 4 additions & 2 deletions tests/openvino/test_exporters_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
OVStableDiffusionPipeline,
OVStableDiffusionXLPipeline,
)
from optimum.intel.openvino.utils import _HEAD_TO_AUTOMODELS
from optimum.intel.openvino.utils import _HEAD_TO_AUTOMODELS, OV_TOKENIZER_FOLDER
from optimum.intel.utils.import_utils import is_openvino_tokenizers_available


Expand Down Expand Up @@ -140,7 +140,9 @@ def test_exporters_cli_tokenizers(self, task: str, model_type: str):
)
return

number_of_tokenizers = sum("tokenizer" in file for file in map(str, Path(tmpdir).rglob("*.xml")))
number_of_tokenizers = sum(
"tokenizer" in file for file in map(str, (Path(tmpdir) / OV_TOKENIZER_FOLDER).rglob("*.xml"))
)
self.assertEqual(self.EXPECTED_NUMBER_OF_TOKENIZER_MODELS[model_type], number_of_tokenizers, output)

if number_of_tokenizers == 1:
Expand Down
Loading