You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
120
118
121
-
###Static quantization
119
+
## Static quantization
122
120
123
121
When applying post-training static quantization, both the weights and the activations are quantized.
124
122
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
167
165
168
166
169
-
####Speech-to-text Models Quantization
167
+
### Speech-to-text Models Quantization
170
168
171
169
The speech-to-text Whisper model can be quantized without the need for preparing a custom calibration dataset. Please see example below.
With this, encoder, decoder and decoder-with-past models of the Whisper pipeline will be fully quantized, including activations.
187
185
188
-
###Hybrid quantization
186
+
## Hybrid quantization
189
187
190
188
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
191
189
The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
@@ -209,170 +207,3 @@ model = OVStableDiffusionPipeline.from_pretrained(
209
207
210
208
211
209
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).
212
-
213
-
214
-
## Training-time
215
-
216
-
Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
217
-
218
-
<Tipwarning={true}>
219
-
220
-
Training-time optimization methods are deprecated and will be removed in optimum-intel v1.22.0.
221
-
222
-
</Tip>
223
-
224
-
225
-
### Quantization-Aware Training (QAT)
226
-
227
-
QAT simulates the effects of quantization during training, in order to alleviate its effects on the model's accuracy. It is recommended in the case where post-training quantization results in high accuracy degradation. Here is an example on how to fine-tune a DistilBERT on the sst-2 task while applying quantization aware training (QAT).
228
-
229
-
```diff
230
-
import evaluate
231
-
import numpy as np
232
-
from transformers import (
233
-
AutoModelForSequenceClassification,
234
-
AutoTokenizer,
235
-
TrainingArguments,
236
-
default_data_collator,
237
-
)
238
-
from datasets import load_dataset
239
-
- from transformers import Trainer
240
-
+ from optimum.intel import OVConfig, OVTrainer, OVModelForSequenceClassification
# Export the quantized model to OpenVINO IR format and save it
277
-
trainer.save_model()
278
-
279
-
# Load the resulting quantized model
280
-
- model = AutoModelForSequenceClassification.from_pretrained(save_dir)
281
-
+ model = OVModelForSequenceClassification.from_pretrained(save_dir)
282
-
```
283
-
284
-
285
-
### Joint Pruning, Quantization and Distillation (JPQD)
286
-
287
-
Other than quantization, compression methods like pruning and distillation are common in further improving the task performance and efficiency. Structured pruning slims a model for lower computational demands while distillation leverages knowledge of a teacher, usually, larger model to improve model prediction. Combining these methods with quantization can result in optimized model with significant efficiency improvement while enjoying good task accuracy retention. In `optimum.openvino`, `OVTrainer` provides the capability to jointly prune, quantize and distill a model during training. Following is an example on how to perform the optimization on BERT-base for the sst-2 task.
288
-
289
-
First, we create a config dictionary to specify the target algorithms. As `optimum.openvino` relies on NNCF as backend, the config format follows NNCF specifications (see [here](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms)). In the example config below, we specify pruning and quantization in a list of compression with thier hyperparameters. The pruning method closely resembles the work of [Lagunas et al., 2021, Block Pruning For Faster Transformers](https://arxiv.org/pdf/2109.04838.pdf) whereas the quantization refers to QAT. With this configuration, the model under optimization will be initialized with pruning and quantization operators at the beginning of the training.
> Known limitation: Current structured pruning with movement sparsity only supports *BERT, Wav2vec2 and Swin* family of models. See [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md) for more information.
320
-
321
-
Once we have the config ready, we can start develop the training pipeline like the snippet below. Since we are customizing joint compression with config above, notice that `OVConfig` is initialized with config dictionary (JSON parsing to python dictionary is skipped for brevity). As for distillation, users are required to load the teacher model, it is just like a normal model loading with transformers API. `OVTrainingArguments` extends transformers' `TrainingArguments` with distillation hyperparameters, i.e. distillation weightage and temperature for ease of use. The snippet below shows how we load a teacher model and create training arguments with `OVTrainingArguments`. Subsequently, the teacher model, with the instantiated `OVConfig` and `OVTrainingArguments` are fed to `OVTrainer`. Voila! that is all we need, the rest of the pipeline is identical to native transformers training.
322
-
323
-
```diff
324
-
- from transformers import Trainer, TrainingArguments
325
-
+ from optimum.intel import OVConfig, OVTrainer, OVTrainingArguments
# Train the model like usual, internally the training is applied with pruning, quantization and distillation
348
-
train_result = trainer.train()
349
-
metrics = trainer.evaluate()
350
-
# Export the quantized model to OpenVINO IR format and save it
351
-
trainer.save_model()
352
-
```
353
-
354
-
More on the description and how to configure movement sparsity, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
355
-
356
-
More on available algorithms in NNCF, see documentation [here](https://github.com/openvinotoolkit/nncf/tree/develop/docs/usage/training_time_compression/other_algorithms).
357
-
358
-
For complete JPQD scripts, please refer to examples provided [here](https://github.com/huggingface/optimum-intel/tree/main/examples/openvino).
359
-
360
-
Quantization-Aware Training (QAT) and knowledge distillation can also be combined in order to optimize Stable Diffusion models while maintaining accuracy. For more details, take a look at this [blog post](https://huggingface.co/blog/train-optimize-sd-intel).
361
-
362
-
## Inference with Transformers pipeline
363
-
364
-
After applying quantization on our model, we can then easily load it with our `OVModelFor<Task>` classes and perform inference with OpenVINO Runtime using the Transformers [pipelines](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines).
365
-
366
-
```python
367
-
from transformers import pipeline
368
-
from optimum.intel import OVModelForSequenceClassification
0 commit comments