Skip to content

Commit c1bd7f7

Browse files
authored
Merge branch 'huggingface:main' into varlen
2 parents 7e20b86 + 87c431c commit c1bd7f7

22 files changed

+1308
-239
lines changed

.github/workflows/test_openvino.yml

+5
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,11 @@ jobs:
5050
name: Install specific dependencies and versions required for older transformers
5151
run: |
5252
pip install transformers==${{ matrix.transformers-version }} accelerate==0.* peft==0.13.* diffusers==0.30.* transformers_stream_generator
53+
54+
- if: ${{ matrix.transformers-version == 'latest' && matrix.test-pattern == '*modeling*'}}
55+
name: Install auto-gptq, autoawq
56+
run: |
57+
pip install auto-gptq autoawq --extra-index-url https://download.pytorch.org/whl/cpu
5358
5459
- if: ${{ matrix.test-pattern == '*modeling*' }}
5560
name: Uninstall NNCF

.github/workflows/test_openvino_full.yml

+5
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,11 @@ jobs:
7878
if: ${{ matrix.transformers-version != 'latest' }}
7979
run: pip install transformers==${{ matrix.transformers-version }}
8080

81+
- if: ${{ matrix.transformers-version == 'latest' && matrix.os != 'windows-2019' }}
82+
name: Install auto-gptq, autoawq
83+
run: |
84+
pip install auto-gptq autoawq --extra-index-url https://download.pytorch.org/whl/cpu
85+
8186
- name: Pip freeze
8287
run: pip freeze
8388

.github/workflows/test_openvino_slow.yml

+5
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,11 @@ jobs:
4949
name: Install specific dependencies and versions required for older transformers
5050
run: pip install transformers==${{ matrix.transformers-version }} accelerate==0.* peft==0.13.*, diffusers==0.30.* transformers_stream_generator
5151

52+
- if: ${{ matrix.transformers-version == 'latest' && matrix.os != 'windows-2019' }}
53+
name: Install auto-gptq, autoawq
54+
run: |
55+
pip install auto-gptq autoawq --extra-index-url https://download.pytorch.org/whl/cpu
56+
5257
- name: Pip freeze
5358
run: pip freeze
5459

docs/source/openvino/export.mdx

+17-3
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,14 @@ Check out the help for more options:
3131

3232
```text
3333
usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code]
34-
[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}]
34+
[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}] [--quant-mode {int8}]
3535
[--library {transformers,diffusers,timm,sentence_transformers,open_clip}]
3636
[--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
3737
[--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}]
3838
[--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq]
3939
[--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC]
4040
[--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer]
41+
[--smooth-quant-alpha SMOOTH_QUANT_ALPHA]
4142
output
4243

4344
optional arguments:
@@ -66,6 +67,10 @@ Optional arguments:
6667
on your local machine arbitrary code present in the model repository.
6768
--weight-format {fp32,fp16,int8,int4,mxfp4,nf4}
6869
The weight format of the exported model.
70+
--quant-mode {int8}
71+
Quantization precision mode. This is used for applying full model quantization including
72+
activations. The only currently supported choice is 'int8' for int8 quantization of both
73+
weights and activations.
6974
--library {transformers,diffusers,timm,sentence_transformers,open_clip}
7075
The library used to load the model before export. If not provided, will attempt to infer the
7176
local checkpoint's library
@@ -102,8 +107,8 @@ Optional arguments:
102107
weight compression is applied, they are compressed to INT8.
103108
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs, but
104109
requires additional time for tuning weights on a calibration dataset. To run AWQ, please also
105-
provide a dataset argument. Note: it is possible that there will be no matching patterns in the
106-
model to apply AWQ, in such case it will be skipped.
110+
provide a dataset argument. Note: it is possible that there will be no matching patterns in
111+
the model to apply AWQ, in such case it will be skipped.
107112
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between
108113
the original and compressed layers. Providing a dataset is required to run scale estimation.
109114
Please note, that applying scale estimation takes additional memory and time.
@@ -128,6 +133,9 @@ Optional arguments:
128133
OpenVINO native inference code that expects KV-cache inputs and outputs in the model.
129134
--disable-convert-tokenizer
130135
Do not add converted tokenizer and detokenizer OpenVINO models.
136+
--smooth-quant-alpha SMOOTH_QUANT_ALPHA
137+
SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers
138+
and reduces quantization error. Valid only when activations quantization is enabled.
131139
```
132140

133141
You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`.
@@ -158,6 +166,12 @@ Models larger than 1 billion parameters are exported to the OpenVINO format with
158166
</Tip>
159167

160168

169+
Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to `int8`. This will quantize both weights and activations of Linear, Convolutional and some other layers to int8. Currently this is only supported for speech-to-text models. Please see example below.
170+
171+
```bash
172+
optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-large-v3-turbo
173+
```
174+
161175
### Decoder models
162176

163177
For models with a decoder, we enable the re-use of past keys and values by default. This allows to avoid recomputing the same intermediate activations at each generation step. To export the model without, you will need to remove the `-with-past` suffix when specifying the task.

optimum/commands/export/openvino.py

+66-5
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,16 @@ def parse_args_openvino(parser: "ArgumentParser"):
7575
default=None,
7676
help="The weight format of the exported model.",
7777
)
78+
optional_group.add_argument(
79+
"--quant-mode",
80+
type=str,
81+
choices=["int8"],
82+
default=None,
83+
help=(
84+
"Quantization precision mode. This is used for applying full model quantization including activations. "
85+
"The only currently supported choice is 'int8' for int8 quantization of both weights and activations."
86+
),
87+
)
7888
optional_group.add_argument(
7989
"--library",
8090
type=str,
@@ -228,6 +238,15 @@ def parse_args_openvino(parser: "ArgumentParser"):
228238
action="store_true",
229239
help="Do not add converted tokenizer and detokenizer OpenVINO models.",
230240
)
241+
optional_group.add_argument(
242+
"--smooth-quant-alpha",
243+
type=float,
244+
default=None,
245+
help=(
246+
"SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers and "
247+
"reduces quantization error. Valid only when activations quantization is enabled."
248+
),
249+
)
231250

232251

233252
def no_compression_parameter_provided(args):
@@ -252,6 +271,20 @@ def no_compression_parameter_provided(args):
252271
)
253272

254273

274+
def no_quantization_parameter_provided(args):
275+
return all(
276+
(
277+
it is None
278+
for it in (
279+
args.sym,
280+
args.dataset,
281+
args.num_samples,
282+
args.smooth_quant_alpha,
283+
)
284+
)
285+
)
286+
287+
255288
class OVExportCommand(BaseOptimumCLICommand):
256289
COMMAND = CommandInfo(name="openvino", help="Export PyTorch models to OpenVINO IR.")
257290

@@ -291,16 +324,21 @@ def run(self):
291324
else:
292325
library_name = self.args.library
293326

294-
if self.args.weight_format is None:
327+
if self.args.weight_format is None and self.args.quant_mode is None:
295328
ov_config = None
296329
if not no_compression_parameter_provided(self.args):
297330
raise ValueError(
298331
"Some compression parameters are provided, but the weight format is not specified. "
299332
"Please provide it with --weight-format argument."
300333
)
334+
if not no_quantization_parameter_provided(self.args):
335+
raise ValueError(
336+
"Some quantization parameters are provided, but the quantization mode is not specified. "
337+
"Please provide it with --quant-mode argument."
338+
)
301339
elif self.args.weight_format in {"fp16", "fp32"}:
302340
ov_config = OVConfig(dtype=self.args.weight_format)
303-
else:
341+
elif self.args.weight_format is not None:
304342
# For int4 quantization if no parameter is provided, then use the default config if exists
305343
if no_compression_parameter_provided(self.args) and self.args.weight_format == "int4":
306344
quantization_config = get_default_int4_config(self.args.model)
@@ -326,6 +364,21 @@ def run(self):
326364
if quantization_config.get("dataset", None) is not None:
327365
quantization_config["trust_remote_code"] = self.args.trust_remote_code
328366
ov_config = OVConfig(quantization_config=quantization_config)
367+
else:
368+
if self.args.quant_mode != "int8":
369+
raise ValueError("Only 'int8' quantization mode is currently supported.")
370+
371+
quantization_config = {
372+
"weight_format": self.args.quant_mode,
373+
"activation_format": self.args.quant_mode,
374+
"bits": 8,
375+
"sym": self.args.sym or False,
376+
"dataset": self.args.dataset,
377+
"num_samples": self.args.num_samples,
378+
"smooth_quant_alpha": self.args.smooth_quant_alpha,
379+
"trust_remote_code": self.args.trust_remote_code,
380+
}
381+
ov_config = OVConfig(quantization_config=quantization_config)
329382

330383
quantization_config = ov_config.quantization_config if ov_config else None
331384
quantize_with_dataset = quantization_config and getattr(quantization_config, "dataset", None) is not None
@@ -368,17 +421,25 @@ def run(self):
368421
model.save_pretrained(self.args.output)
369422
if not self.args.disable_convert_tokenizer:
370423
maybe_convert_tokenizers(library_name, self.args.output, model, task=task)
371-
elif (task.startswith("text-generation") or task == "image-text-to-text") and quantize_with_dataset:
424+
elif (
425+
quantize_with_dataset
426+
and (task.startswith("text-generation") or task == "automatic-speech-recognition")
427+
or (task == "image-text-to-text" and quantization_config is not None)
428+
):
372429
if task.startswith("text-generation"):
373430
from optimum.intel import OVModelForCausalLM
374431

375432
model_cls = OVModelForCausalLM
376-
else:
433+
elif task == "image-text-to-text":
377434
from optimum.intel import OVModelForVisualCausalLM
378435

379436
model_cls = OVModelForVisualCausalLM
437+
else:
438+
from optimum.intel import OVModelForSpeechSeq2Seq
439+
440+
model_cls = OVModelForSpeechSeq2Seq
380441

381-
# To quantize a model with a dataset, an instance of a model class is required
442+
# In this case, to apply quantization an instance of a model class is required
382443
model = model_cls.from_pretrained(
383444
self.args.model,
384445
export=True,

0 commit comments

Comments
 (0)