You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
71
+
72
+
For 4-bit quantization you can also specify the following arguments in the quantization configuration :
73
+
* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
74
+
* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
75
+
76
+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
77
+
78
+
### Static quantization
79
+
80
+
When applying post-training static quantization, both the weights and the activations are quantized.
81
+
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
82
+
83
+
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
84
+
85
+
```python
86
+
from transformers import AutoTokenizer
87
+
from optimum.intel import OVQuantizer, OVModelForSequenceClassification,
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
128
+
The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
129
+
Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components :
130
+
* U-Net : quantization applied on both the weights and activations
131
+
* The text encoder, VAE encoder / decoder : quantization applied on the weights
71
132
72
-
<Tipwarning={true}>
73
-
74
-
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
75
-
76
-
</Tip>
133
+
The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
77
134
78
-
For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:
135
+
The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.
79
136
80
137
```python
81
-
from optimum.intel importOVModelForCausalLM, OVWeightQuantizationConfig
138
+
from optimum.intel importOVStableDiffusionPipeline, OVWeightQuantizationConfig
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
101
149
102
150
103
-
## Training-time optimization
151
+
## Training-time
104
152
105
153
Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
0 commit comments