Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OV] Move data-driven quantization after model export for text-generation models #721

Merged
merged 29 commits into from
Jun 6, 2024

Conversation

nikita-savelyevv
Copy link
Collaborator

What does this PR do?

In order to apply data-driven weights compression, an instance of OVModelForCausalLM is required. It however is not available during quantization applied at model export (here).

That's why in this PR some logic is added so that such case is processed separately after model is exported. This results in some save/load overhead, but compared to runtime of data-drive weights compression it should be negligible. Worth to note, data-free compression is still applied during export resulting in no additional overhead.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Comment on lines 358 to 366
# TODO: set save_directory=self.args.output once OV is updated to 2024.3
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))
with tempfile.TemporaryDirectory() as temp_dir:
import shutil

model.save_pretrained(temp_dir)
ov_config.save_pretrained(self.args.output)
shutil.copy(f"{temp_dir}/openvino_model.xml", f"{self.args.output}/openvino_model.xml")
shutil.copy(f"{temp_dir}/openvino_model.bin", f"{self.args.output}/openvino_model.bin")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add this workaround because OpenVINO does not currently support saving into the same location where the model is loaded from (ticket 110054). This is expected to be fixed in OV 2024.3.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eaidova, please take a look

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can intorduce model_name parameter for from_pretrained/save_pretrained methods? That will allow having both models in the same dir (in the same time it maybe useful for loading model if IR saved by different tools or renamed). Or we can try disable mmap via ov_config for now (it should help with saving in the same location)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eaidova thank you for your suggestion! I've replaced saving to temporary directory with disabling mmap

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eaidova thank you for your suggestion! I've replaced saving to temporary directory with disabling mmap

For some reason when doing it this way I observe that a significant amount of additional memory is allocated. The amount roughly equals the model size which is rather significant. I guess I'll revert these changes for now.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv
Copy link
Collaborator Author

@AlexKoff88 please take a look

import shutil

model.save_pretrained(temp_dir)
ov_config.save_pretrained(self.args.output)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikita-savelyevv, does this workaround with tmp folder mean that we cannot save the model at the same path but can copy files there? Looks a bit strange.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had troubles saving to the same location, but copying to that location works fine

Comment on lines 361 to 381
if quantize_after_export:
try:
from optimum.intel import OVModelForCausalLM, OVQuantizer

ov_config.dtype = original_dtype_value
model = OVModelForCausalLM.from_pretrained(
self.args.output, trust_remote_code=self.args.trust_remote_code
)
quantizer = OVQuantizer(model)
quantization_config.tokenizer = quantization_config.tokenizer or str(self.args.output)
# TODO: set save_directory=self.args.output once OV is updated to 2024.3
quantizer.quantize(ov_config=OVConfig(quantization_config=quantization_config))
with tempfile.TemporaryDirectory() as temp_dir:
model.save_pretrained(temp_dir)
ov_config.save_pretrained(self.args.output)
shutil.copy(f"{temp_dir}/openvino_model.xml", f"{self.args.output}/openvino_model.xml")
shutil.copy(f"{temp_dir}/openvino_model.bin", f"{self.args.output}/openvino_model.bin")
except Exception as e:
# Delete non-compressed model if compression failed for some reason
shutil.rmtree(str(self.args.output))
raise e
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not exporting + applying quantization using OVModelForCausalLM directly (and not calling main_export for this specific case) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible then it could actually make sense to do this for all models (as it's already the case for SD models)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not exporting + applying quantization using OVModelForCausalLM directly (and not calling main_export for this specific case) ?

Thanks for the suggestion! This is indeed better.

If possible then it could actually make sense to do this for all models (as it's already the case for SD models)

It would be more convenient from code maintenance side. But compared to calling main_export() with provided quantization_config, when we export and quantize through .from_pretrained() there is an additional model saving step to temporary directory which leads to an additional overhead, especially for large models. I think we should if avoid this if possible.

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @nikita-savelyevv

@nikita-savelyevv nikita-savelyevv marked this pull request as draft June 4, 2024 12:00
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review June 4, 2024 16:26
Comment on lines 335 to 349
tokenizer = None
try:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
self.args.model, trust_remote_code=self.args.trust_remote_code
)
tokenizer.save_pretrained(self.args.output)
except Exception:
logger.warning("Could not save tokenizer")

if tokenizer and not self.args.disable_convert_tokenizer:
from ...exporters.openvino.convert import export_tokenizer

export_tokenizer(tokenizer, self.args.output / "tokenizer")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to add such logic because otherwise tokenizer ends up being saved to a temporary directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to simplify the code a bit what do you think about calling directly maybe_load_preprocessors

preprocessors = maybe_load_preprocessors(

and to save the tokenizer we can also use maybe_save_preprocessors

maybe_save_preprocessors(model_name_or_path, output, trust_remote_code=trust_remote_code)

also we could create a function

from optimum.exporters.openvino.convert import export_tokenizer
if convert_tokenizer and is_openvino_tokenizers_available():
if library_name != "diffusers":
tokenizer = next(
(preprocessor for preprocessor in preprocessors if isinstance(preprocessor, PreTrainedTokenizerBase)),
None,
)
if tokenizer is not None:
try:
export_tokenizer(tokenizer, output)
except Exception as exception:
logger.warning(
"Could not load tokenizer using specified model ID or path. OpenVINO tokenizer/detokenizer "
f"models won't be generated. Exception: {exception}"
)
else:
tokenizer = getattr(model, "tokenizer", None)
if tokenizer is not None:
export_tokenizer(tokenizer, output / "tokenizer")
tokenizer_2 = getattr(model, "tokenizer_2", None)
if tokenizer_2 is not None:
export_tokenizer(tokenizer_2, output / "tokenizer_2")
elif convert_tokenizer and not is_openvino_tokenizers_available():
logger.warning("Tokenizer won't be converted.")
that could be re-used here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done

@@ -218,6 +249,10 @@ def run(self):
"sym": self.args.sym or False,
"group_size": -1 if is_int8 else self.args.group_size,
"all_layers": None if is_int8 else self.args.all_layers,
"dataset": self.args.dataset,
"num_samples": self.args.num_samples,
"quant_method": "awq" if self.args.awq else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it should be QuantizationMethod.AWQ instead of "awq" or if the configuration takes care of this

awq=config.quant_method == QuantizationMethod.AWQ or None,

Copy link
Collaborator Author

@nikita-savelyevv nikita-savelyevv Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hesitated to do it this way because it would require to introduce dependency on transformers in this file in order to import QuantizationMethod. But now I see that transformers is a general requirement of optimum so it should be fine I guess.

Comment on lines 335 to 349
tokenizer = None
try:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
self.args.model, trust_remote_code=self.args.trust_remote_code
)
tokenizer.save_pretrained(self.args.output)
except Exception:
logger.warning("Could not save tokenizer")

if tokenizer and not self.args.disable_convert_tokenizer:
from ...exporters.openvino.convert import export_tokenizer

export_tokenizer(tokenizer, self.args.output / "tokenizer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to simplify the code a bit what do you think about calling directly maybe_load_preprocessors

preprocessors = maybe_load_preprocessors(

and to save the tokenizer we can also use maybe_save_preprocessors

maybe_save_preprocessors(model_name_or_path, output, trust_remote_code=trust_remote_code)

also we could create a function

from optimum.exporters.openvino.convert import export_tokenizer
if convert_tokenizer and is_openvino_tokenizers_available():
if library_name != "diffusers":
tokenizer = next(
(preprocessor for preprocessor in preprocessors if isinstance(preprocessor, PreTrainedTokenizerBase)),
None,
)
if tokenizer is not None:
try:
export_tokenizer(tokenizer, output)
except Exception as exception:
logger.warning(
"Could not load tokenizer using specified model ID or path. OpenVINO tokenizer/detokenizer "
f"models won't be generated. Exception: {exception}"
)
else:
tokenizer = getattr(model, "tokenizer", None)
if tokenizer is not None:
export_tokenizer(tokenizer, output / "tokenizer")
tokenizer_2 = getattr(model, "tokenizer_2", None)
if tokenizer_2 is not None:
export_tokenizer(tokenizer_2, output / "tokenizer_2")
elif convert_tokenizer and not is_openvino_tokenizers_available():
logger.warning("Tokenizer won't be converted.")
that could be re-used here

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @nikita-savelyevv

@echarlaix echarlaix merged commit 6888c0a into huggingface:main Jun 6, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants