-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV] Move data-driven quantization after model export for text-generation models #721
Conversation
optimum/commands/export/openvino.py
Outdated
# TODO: set save_directory=self.args.output once OV is updated to 2024.3 | ||
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config)) | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
import shutil | ||
|
||
model.save_pretrained(temp_dir) | ||
ov_config.save_pretrained(self.args.output) | ||
shutil.copy(f"{temp_dir}/openvino_model.xml", f"{self.args.output}/openvino_model.xml") | ||
shutil.copy(f"{temp_dir}/openvino_model.bin", f"{self.args.output}/openvino_model.bin") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to add this workaround because OpenVINO does not currently support saving into the same location where the model is loaded from (ticket 110054). This is expected to be fixed in OV 2024.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eaidova, please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can intorduce model_name parameter for from_pretrained/save_pretrained methods? That will allow having both models in the same dir (in the same time it maybe useful for loading model if IR saved by different tools or renamed). Or we can try disable mmap via ov_config for now (it should help with saving in the same location)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eaidova thank you for your suggestion! I've replaced saving to temporary directory with disabling mmap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eaidova thank you for your suggestion! I've replaced saving to temporary directory with disabling mmap
For some reason when doing it this way I observe that a significant amount of additional memory is allocated. The amount roughly equals the model size which is rather significant. I guess I'll revert these changes for now.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@AlexKoff88 please take a look |
optimum/commands/export/openvino.py
Outdated
import shutil | ||
|
||
model.save_pretrained(temp_dir) | ||
ov_config.save_pretrained(self.args.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikita-savelyevv, does this workaround with tmp folder mean that we cannot save the model at the same path but can copy files there? Looks a bit strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had troubles saving to the same location, but copying to that location works fine
…or OVModelForCausalLM
…ndling for OVModelForCausalLM" This reverts commit bcc4665.
…model if compression fails
optimum/commands/export/openvino.py
Outdated
if quantize_after_export: | ||
try: | ||
from optimum.intel import OVModelForCausalLM, OVQuantizer | ||
|
||
ov_config.dtype = original_dtype_value | ||
model = OVModelForCausalLM.from_pretrained( | ||
self.args.output, trust_remote_code=self.args.trust_remote_code | ||
) | ||
quantizer = OVQuantizer(model) | ||
quantization_config.tokenizer = quantization_config.tokenizer or str(self.args.output) | ||
# TODO: set save_directory=self.args.output once OV is updated to 2024.3 | ||
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config)) | ||
with tempfile.TemporaryDirectory() as temp_dir: | ||
model.save_pretrained(temp_dir) | ||
ov_config.save_pretrained(self.args.output) | ||
shutil.copy(f"{temp_dir}/openvino_model.xml", f"{self.args.output}/openvino_model.xml") | ||
shutil.copy(f"{temp_dir}/openvino_model.bin", f"{self.args.output}/openvino_model.bin") | ||
except Exception as e: | ||
# Delete non-compressed model if compression failed for some reason | ||
shutil.rmtree(str(self.args.output)) | ||
raise e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not exporting + applying quantization using OVModelForCausalLM
directly (and not calling main_export
for this specific case) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible then it could actually make sense to do this for all models (as it's already the case for SD models)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not exporting + applying quantization using OVModelForCausalLM directly (and not calling main_export for this specific case) ?
Thanks for the suggestion! This is indeed better.
If possible then it could actually make sense to do this for all models (as it's already the case for SD models)
It would be more convenient from code maintenance side. But compared to calling main_export()
with provided quantization_config
, when we export and quantize through .from_pretrained()
there is an additional model saving step to temporary directory which leads to an additional overhead, especially for large models. I think we should if avoid this if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks @nikita-savelyevv
optimum/commands/export/openvino.py
Outdated
tokenizer = None | ||
try: | ||
from transformers import AutoTokenizer | ||
|
||
tokenizer = AutoTokenizer.from_pretrained( | ||
self.args.model, trust_remote_code=self.args.trust_remote_code | ||
) | ||
tokenizer.save_pretrained(self.args.output) | ||
except Exception: | ||
logger.warning("Could not save tokenizer") | ||
|
||
if tokenizer and not self.args.disable_convert_tokenizer: | ||
from ...exporters.openvino.convert import export_tokenizer | ||
|
||
export_tokenizer(tokenizer, self.args.output / "tokenizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to add such logic because otherwise tokenizer ends up being saved to a temporary directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to simplify the code a bit what do you think about calling directly maybe_load_preprocessors
preprocessors = maybe_load_preprocessors( |
and to save the tokenizer we can also use maybe_save_preprocessors
maybe_save_preprocessors(model_name_or_path, output, trust_remote_code=trust_remote_code) |
also we could create a function
optimum-intel/optimum/exporters/openvino/__main__.py
Lines 364 to 390 in 8dd2a89
from optimum.exporters.openvino.convert import export_tokenizer | |
if convert_tokenizer and is_openvino_tokenizers_available(): | |
if library_name != "diffusers": | |
tokenizer = next( | |
(preprocessor for preprocessor in preprocessors if isinstance(preprocessor, PreTrainedTokenizerBase)), | |
None, | |
) | |
if tokenizer is not None: | |
try: | |
export_tokenizer(tokenizer, output) | |
except Exception as exception: | |
logger.warning( | |
"Could not load tokenizer using specified model ID or path. OpenVINO tokenizer/detokenizer " | |
f"models won't be generated. Exception: {exception}" | |
) | |
else: | |
tokenizer = getattr(model, "tokenizer", None) | |
if tokenizer is not None: | |
export_tokenizer(tokenizer, output / "tokenizer") | |
tokenizer_2 = getattr(model, "tokenizer_2", None) | |
if tokenizer_2 is not None: | |
export_tokenizer(tokenizer_2, output / "tokenizer_2") | |
elif convert_tokenizer and not is_openvino_tokenizers_available(): | |
logger.warning("Tokenizer won't be converted.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done
optimum/commands/export/openvino.py
Outdated
@@ -218,6 +249,10 @@ def run(self): | |||
"sym": self.args.sym or False, | |||
"group_size": -1 if is_int8 else self.args.group_size, | |||
"all_layers": None if is_int8 else self.args.all_layers, | |||
"dataset": self.args.dataset, | |||
"num_samples": self.args.num_samples, | |||
"quant_method": "awq" if self.args.awq else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if it should be QuantizationMethod.AWQ
instead of "awq" or if the configuration takes care of this
awq=config.quant_method == QuantizationMethod.AWQ or None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hesitated to do it this way because it would require to introduce dependency on transformers
in this file in order to import QuantizationMethod
. But now I see that transformers
is a general requirement of optimum
so it should be fine I guess.
optimum/commands/export/openvino.py
Outdated
tokenizer = None | ||
try: | ||
from transformers import AutoTokenizer | ||
|
||
tokenizer = AutoTokenizer.from_pretrained( | ||
self.args.model, trust_remote_code=self.args.trust_remote_code | ||
) | ||
tokenizer.save_pretrained(self.args.output) | ||
except Exception: | ||
logger.warning("Could not save tokenizer") | ||
|
||
if tokenizer and not self.args.disable_convert_tokenizer: | ||
from ...exporters.openvino.convert import export_tokenizer | ||
|
||
export_tokenizer(tokenizer, self.args.output / "tokenizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to simplify the code a bit what do you think about calling directly maybe_load_preprocessors
preprocessors = maybe_load_preprocessors( |
and to save the tokenizer we can also use maybe_save_preprocessors
maybe_save_preprocessors(model_name_or_path, output, trust_remote_code=trust_remote_code) |
also we could create a function
optimum-intel/optimum/exporters/openvino/__main__.py
Lines 364 to 390 in 8dd2a89
from optimum.exporters.openvino.convert import export_tokenizer | |
if convert_tokenizer and is_openvino_tokenizers_available(): | |
if library_name != "diffusers": | |
tokenizer = next( | |
(preprocessor for preprocessor in preprocessors if isinstance(preprocessor, PreTrainedTokenizerBase)), | |
None, | |
) | |
if tokenizer is not None: | |
try: | |
export_tokenizer(tokenizer, output) | |
except Exception as exception: | |
logger.warning( | |
"Could not load tokenizer using specified model ID or path. OpenVINO tokenizer/detokenizer " | |
f"models won't be generated. Exception: {exception}" | |
) | |
else: | |
tokenizer = getattr(model, "tokenizer", None) | |
if tokenizer is not None: | |
export_tokenizer(tokenizer, output / "tokenizer") | |
tokenizer_2 = getattr(model, "tokenizer_2", None) | |
if tokenizer_2 is not None: | |
export_tokenizer(tokenizer_2, output / "tokenizer_2") | |
elif convert_tokenizer and not is_openvino_tokenizers_available(): | |
logger.warning("Tokenizer won't be converted.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks a lot @nikita-savelyevv
What does this PR do?
In order to apply data-driven weights compression, an instance of
OVModelForCausalLM
is required. It however is not available during quantization applied at model export (here).That's why in this PR some logic is added so that such case is processed separately after model is exported. This results in some save/load overhead, but compared to runtime of data-drive weights compression it should be negligible. Worth to note, data-free compression is still applied during export resulting in no additional overhead.
Before submitting