You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
71
+
72
+
For 4-bit quantization you can also specify the following arguments in the quantization configuration :
73
+
* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
74
+
* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
75
+
76
+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
77
+
78
+
### Static quantization
79
+
80
+
When applying post-training static quantization, both the weights and the activations are quantized.
81
+
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
82
+
83
+
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
84
+
85
+
```python
86
+
from transformers import AutoTokenizer
87
+
from optimum.intel import OVQuantizer, OVModelForSequenceClassification,
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
128
+
The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
129
+
Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components :
130
+
* U-Net : quantization applied on both the weights and activations
131
+
* The text encoder, VAE encoder / decoder : quantization applied on the weights
71
132
72
-
##Hybridquantization
133
+
The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.
78
136
79
137
```python
80
138
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
@@ -86,38 +144,11 @@ model = OVStableDiffusionPipeline.from_pretrained(
86
144
)
87
145
```
88
146
89
-
<Tipwarning={true}>
90
-
91
-
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
92
-
93
-
</Tip>
94
-
95
-
For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:
96
-
97
-
```python
98
-
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
118
149
119
150
120
-
## Training-time optimization
151
+
## Training-time
121
152
122
153
Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
when running `transformers-cli login` (stored in `~/.huggingface`).
112
112
model_kwargs (`Optional[Dict[str, Any]]`, defaults to `None`):
113
113
Experimental usage: keyword arguments to pass to the model during
114
-
the export. This argument should be used along the `custom_onnx_configs` argument
114
+
the export. This argument should be used along the `custom_export_configs` argument
115
115
in case, for example, the model inputs/outputs are changed (for example, if
116
116
`model_kwargs={"output_attentions": True}` is passed).
117
-
custom_onnx_configs (`Optional[Dict[str, OnnxConfig]]`, defaults to `None`):
118
-
Experimental usage: override the default ONNX config used for the given model. This argument may be useful for advanced users that desire a finer-grained control on the export. An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
117
+
custom_export_configs (`Optional[Dict[str, OnnxConfig]]`, defaults to `None`):
118
+
Experimental usage: override the default export config used for the given model. This argument may be useful for advanced users that desire a finer-grained control on the export. An example is available [here](https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model).
119
119
fn_get_submodels (`Optional[Callable]`, defaults to `None`):
120
120
Experimental usage: Override the default submodels that are used at the export. This is
121
121
especially useful when exporting a custom architecture that needs to split the ONNX (e.g. encoder-decoder). If unspecified with custom models, optimum will try to use the default submodels used for the given task, with no guarantee of success.
@@ -133,7 +133,7 @@ def main_export(
133
133
```python
134
134
>>> from optimum.exporters.openvino import main_export
f"Asked to export a {model_type} model for the task {task}{autodetected_message}, but the Optimum OpenVINO exporter only supports the tasks {', '.join(model_tasks.keys())} for {model_type}. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the task {task} to be supported in the ONNX export for {model_type}."
0 commit comments