Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Tokenizers By Default #580

Merged
merged 26 commits into from
Apr 18, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
6d72fe0
Convert Tokenizers By Default
apaniukov Feb 29, 2024
ffcae81
Add Warning to Deprecated Option
apaniukov Feb 29, 2024
eadb210
Update OV Tokenizers Availability Check
apaniukov Mar 8, 2024
8a22002
Move openvino-tokenizers to openvino dependencies
apaniukov Mar 13, 2024
40f227d
Make Style
apaniukov Mar 13, 2024
d9e3a0f
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Mar 13, 2024
52e24b5
Change Imports to Absolute
apaniukov Mar 13, 2024
fd1887f
Check openvino-nightly compatibility
apaniukov Mar 14, 2024
92d42f4
Change model skip explanation
apaniukov Mar 14, 2024
feb12dd
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Mar 14, 2024
86a1eca
Update OV Tokenizers Availability Check
apaniukov Mar 22, 2024
9041fe5
Add Check for OpenVINO Nightly and Archive
apaniukov Mar 25, 2024
277225c
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Mar 28, 2024
3dcaf9e
Add linux distros compatibility message
apaniukov Mar 28, 2024
241d265
Address Review Comments
apaniukov Mar 28, 2024
4d3df41
Address Review Comments
apaniukov Mar 28, 2024
efd6638
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Apr 3, 2024
eb05594
Address Review Comments
apaniukov Apr 4, 2024
4ed432d
Fix Style
apaniukov Apr 4, 2024
3710884
Change Warnings to Debug Level
apaniukov Apr 17, 2024
a5ef7b1
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Apr 17, 2024
cb2b26f
Merge branch 'main' into del-convert-tokenizer-flag
apaniukov Apr 17, 2024
80d4c1d
Fix Tests Debug Message
apaniukov Apr 18, 2024
85c925a
Merge branch 'main' into del-convert-tokenizer-flag
echarlaix Apr 18, 2024
f4b3301
Fix Style
apaniukov Apr 18, 2024
934ea22
Fix Style
apaniukov Apr 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions optimum/commands/export/openvino.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,10 +115,15 @@ def parse_args_openvino(parser: "ArgumentParser"):
"OpenVINO native inference code that expects kv-cache inputs and outputs in the model."
),
)
optional_group.add_argument(
"--disable-convert-tokenizer",
action="store_true",
help="Do not add converted tokenizer and detokenizer OpenVINO models.",
)
optional_group.add_argument(
"--convert-tokenizer",
action="store_true",
help="Add converted tokenizer and detokenizer with OpenVINO Tokenizers",
help="[Deprecated] Add converted tokenizer and detokenizer with OpenVINO Tokenizers.",
)


Expand Down Expand Up @@ -189,6 +194,9 @@ def run(self):
quantization_config["group_size"] = 128 if "128" in self.args.weight_format else 64
ov_config = OVConfig(quantization_config=quantization_config)

if self.args.convert_tokenizer:
logger.warning("`--convert-tokenizer` option is deprecated. Tokenizer will be converted by default.")

# TODO : add input shapes
main_export(
model_name_or_path=self.args.model,
Expand All @@ -200,6 +208,6 @@ def run(self):
pad_token_id=self.args.pad_token_id,
ov_config=ov_config,
stateful=not self.args.disable_stateful,
convert_tokenizer=self.args.convert_tokenizer,
convert_tokenizer=not self.args.disable_convert_tokenizer,
# **input_shapes,
)
11 changes: 2 additions & 9 deletions optimum/exporters/openvino/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,10 @@
from optimum.exporters import TasksManager
from optimum.exporters.onnx.base import OnnxConfig
from optimum.exporters.onnx.constants import SDPA_ARCHS_ONNX_EXPORT_NOT_SUPPORTED
from optimum.exporters.openvino.convert import export_from_model, export_tokenizer
from optimum.intel.utils.import_utils import is_transformers_version
from optimum.utils.save_utils import maybe_load_preprocessors

from ...intel.utils.import_utils import is_openvino_tokenizers_available, is_transformers_version
from .convert import export_from_model, export_tokenizer


if TYPE_CHECKING:
from optimum.intel.openvino.configuration import OVConfig
Expand Down Expand Up @@ -179,12 +178,6 @@ def main_export(
f"The task could not be automatically inferred as this is available only for models hosted on the Hugging Face Hub. Please provide the argument --task with the relevant task from {', '.join(TasksManager.get_all_tasks())}. Detailed error: {e}"
)

if convert_tokenizer and not is_openvino_tokenizers_available():
logger.warning(
"`convert_tokenizer` requires openvino-tokenizers, please install it with `pip install optimum-intel[openvino-tokenizers]`"
)
convert_tokenizer = False

do_gptq_patching = False
custom_architecture = False
loading_kwargs = {}
Expand Down
46 changes: 34 additions & 12 deletions optimum/intel/utils/import_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,21 +94,43 @@
_openvino_tokenizers_available = False

if _openvino_tokenizers_available and _openvino_tokenizers_version != "N/A":
_compatible_openvino_version = next(
(
requirement.split("==")[-1]
for requirement in importlib_metadata.requires("openvino-tokenizers")
if requirement.startswith("openvino==")
),
"",
)
_openvino_tokenizers_available = _compatible_openvino_version == ov_major_version
_is_ovt_dev_version = "dev" in _openvino_tokenizers_version
_ov_version = importlib_metadata.version("openvino")
_is_ov_dev_version = "dev" in _ov_version
if _is_ovt_dev_version:
_compatible_openvino_major_version, _, _dev_date = _openvino_tokenizers_version.rsplit(".", 2)
_compatible_ov_version = _compatible_openvino_major_version + "." + _dev_date
_compatible_ovt_version = _ov_version.replace("dev", "0.dev")
else:
_compatible_ov_version = _openvino_tokenizers_version.rsplit(".", 1)[0]
_compatible_ovt_version = _ov_version + ".0"

_openvino_tokenizers_available = _ov_version == _compatible_ov_version

if not _openvino_tokenizers_available:
_update_ov_command = (
f"pip install {'--pre' if _is_ovt_dev_version else ''} -U openvino=={_compatible_ov_version} "
+ (
"--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly"
if _is_ovt_dev_version
else ""
)
).strip()
_update_ovt_command = (
f"pip install {'--pre' if _is_ov_dev_version else ''} -U openvino-tokenizers=={_compatible_ovt_version} "
+ (
"--extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly"
if _is_ov_dev_version
else ""
)
).strip()
logger.warning(
"OpenVINO Tokenizer version is not compatible with OpenVINO version. "
f"Installed OpenVINO version: {ov_major_version},"
f"OpenVINO Tokenizers requires {_compatible_openvino_version}. "
f"OpenVINO Tokenizers models will not be added during export."
f"Installed OpenVINO version: {_ov_version}, "
f"OpenVINO Tokenizers requires {_compatible_ov_version}. "
"OpenVINO Tokenizers models will not be added during export. "
f"Update OpenVINO with \n{_update_ov_command}\n"
f"Or update OpenVINO Tokenizers with \n{_update_ovt_command}"
)

_nncf_available = importlib.util.find_spec("nncf") is not None
Expand Down
3 changes: 1 addition & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,7 @@

EXTRAS_REQUIRE = {
"neural-compressor": ["neural-compressor>=2.2.0", "onnxruntime<1.15.0", "accelerate"],
"openvino": ["openvino>=2023.3", "nncf>=2.8.1"],
"openvino-tokenizers": ["openvino-tokenizers[transformers]"],
"openvino": ["openvino>=2023.3", "nncf>=2.8.1", "openvino-tokenizers[transformers]"],
"nncf": ["nncf>=2.8.1"],
"ipex": ["intel-extension-for-pytorch"],
"diffusers": ["diffusers"],
Expand Down
2 changes: 1 addition & 1 deletion tests/openvino/test_exporters_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ def test_exporters_cli(self, task: str, model_type: str):
def test_exporters_cli_tokenizers(self, task: str, model_type: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shoudl remove :

- @unittest.skipIf(not is_openvino_tokenizers_available(), reason="OpenVINO Tokenizers not available")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and add logic checks for "Test openvino-nightly" stage, where the libraries are not compatible.

with TemporaryDirectory() as tmpdir:
output = subprocess.check_output(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check whether no error was raised here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With

optimum-cli export openvino --model hf-internal-testing/tiny-random-t5 --task text2text-generation ov_model

I'm getting the following error :

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to have this fixed before making the tokenizer export default / merging this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a bug, we don't support the Unigram model from tokenizer.json yet, only from sentencepiece model file. If you use a repository that has such a file, the tokenizer will be converted.

I added one more check for the vocab file here: openvinotoolkit/openvino_tokenizers#116
This will transform

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: [Errno 2] No such file or directory: '/tmp/tmprj8zsg44/spiece.model'

into:

OpenVINO Tokenizer export for T5TokenizerFast is not supported. Exception: Cannot convert tokenizer of this type without `.model` file.

The issue is that the original tokenizer has info about .model file in the tokenizer object, but it does not have it in practice:
image

I would argue that this is an edge case of a test repo and most tokenizers with info about the .model file have it, so it should not block the merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable for not all cases to be supported, but for these cases we should disable the tokenizer export and not throw an error as it won't be expected by users and could be a bit confusing in my opinion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we either disable this warning or disable export for unsupported cases ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is for any exception that appeared during conversion. After seeing this message, the user can create an issue for a particular model/tokenizer support.
I also prefer to tell the user that the tokenizer is not converted, rather than silently not including it without warning, so the lack of a (de)tokenizer model won't be a surprise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree if the export of the tokenizer was an explicit choice of the user (current --convert-tokenizer) but as this PR makes it default I think having such warning can be an issue as it makes it look like the export failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the log level to debug, the message won't be visible by default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @apaniukov

f"optimum-cli export openvino --model {MODEL_NAMES[model_type]} --convert-tokenizer --task {task} {tmpdir}",
f"optimum-cli export openvino --model {MODEL_NAMES[model_type]} --task {task} {tmpdir}",
shell=True,
stderr=subprocess.STDOUT,
).decode()
Expand Down
Loading