Skip to content

Commit 5342505

Browse files
Update docs
1 parent a2e1e73 commit 5342505

File tree

1 file changed

+209
-0
lines changed

1 file changed

+209
-0
lines changed

docs/source/openvino/optimization.mdx

+209
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
<!---
2+
Copyright 2022 The HuggingFace Team. All rights reserved.
3+
4+
Licensed under the Apache License, Version 2.0 (the "License");
5+
you may not use this file except in compliance with the License.
6+
You may obtain a copy of the License at
7+
8+
http://www.apache.org/licenses/LICENSE-2.0
9+
10+
Unless required by applicable law or agreed to in writing, software
11+
distributed under the License is distributed on an "AS IS" BASIS,
12+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
See the License for the specific language governing permissions and
14+
limitations under the License.
15+
-->
16+
17+
# Optimization
18+
19+
🤗 Optimum Intel provides an `openvino` package that enables you to apply a variety of model quantization methods on many models hosted on the 🤗 hub using the [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization.html) framework.
20+
21+
22+
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.
23+
24+
## Weight-only quantization
25+
26+
Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
27+
28+
29+
### 8-bit
30+
31+
For the 8-bit weight quantization you can provide `quantization_config` equal to `OVWeightQuantizationConfig(bits=8)` to load your model's weights in 8-bit:
32+
33+
```python
34+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
35+
36+
model_id = "helenai/gpt2-ov"
37+
quantization_config = OVWeightQuantizationConfig(bits=8)
38+
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
39+
40+
# Saves the int8 model that will be x4 smaller than its fp32 counterpart
41+
model.save_pretrained(saving_directory)
42+
```
43+
44+
Weights of language models inside vision-language pipelines can be quantized in a similar way:
45+
```python
46+
model = OVModelForVisualCausalLM.from_pretrained(
47+
"llava-hf/llava-v1.6-mistral-7b-hf",
48+
quantization_config=quantization_config
49+
)
50+
```
51+
52+
<Tip warning={true}>
53+
54+
If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with `load_in_8bit=False`.
55+
56+
</Tip>
57+
58+
59+
### 4-bit
60+
61+
4-bit weight quantization can be achieved in a similar way:
62+
63+
```python
64+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
65+
66+
quantization_config = OVWeightQuantizationConfig(bits=4)
67+
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
68+
```
69+
70+
Or for vision-language pipelines:
71+
```python
72+
model = OVModelForVisualCausalLM.from_pretrained(
73+
"llava-hf/llava-v1.6-mistral-7b-hf",
74+
quantization_config=quantization_config
75+
)
76+
```
77+
78+
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
79+
80+
```python
81+
quantization_config = OVWeightQuantizationConfig(
82+
bits=4,
83+
sym=False,
84+
ratio=0.8,
85+
quant_method="awq",
86+
dataset="wikitext2"
87+
)
88+
```
89+
90+
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) you can add `sym=True`.
91+
92+
For 4-bit quantization you can also specify the following arguments in the quantization configuration :
93+
* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
94+
* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
95+
96+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
97+
98+
Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods:
99+
* **AWQ** which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped.
100+
* **Scale Estimation** is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
101+
* **GPTQ** optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
102+
* **LoRA Correction** mitigates quantization noise introduced during weight compression by leveraging low-rank adaptation.
103+
104+
Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
105+
```python
106+
quantization_config = OVWeightQuantizationConfig(
107+
bits=4,
108+
sym=False,
109+
ratio=0.8,
110+
quant_method="awq",
111+
scale_estimation=True,
112+
gptq=True,
113+
dataset="wikitext2"
114+
)
115+
```
116+
117+
Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.
118+
119+
## Static quantization
120+
121+
When applying post-training static quantization, both the weights and the activations are quantized.
122+
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
123+
124+
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
125+
126+
```python
127+
from transformers import AutoTokenizer
128+
from optimum.intel import OVQuantizer, OVModelForSequenceClassification, OVConfig, OVQuantizationConfig
129+
130+
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
131+
model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
132+
tokenizer = AutoTokenizer.from_pretrained(model_id)
133+
# The directory where the quantized model will be saved
134+
save_dir = "ptq_model"
135+
136+
quantizer = OVQuantizer.from_pretrained(model)
137+
138+
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
139+
ov_config = OVConfig(quantization_config=OVQuantizationConfig())
140+
quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir)
141+
# Save the tokenizer
142+
tokenizer.save_pretrained(save_dir)
143+
```
144+
145+
The calibration dataset can also be created easily using your `OVQuantizer`:
146+
147+
```python
148+
from functools import partial
149+
150+
def preprocess_function(examples, tokenizer):
151+
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
152+
153+
# Create the calibration dataset used to perform static quantization
154+
calibration_dataset = quantizer.get_calibration_dataset(
155+
"glue",
156+
dataset_config_name="sst2",
157+
preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
158+
num_samples=300,
159+
dataset_split="train",
160+
)
161+
```
162+
163+
164+
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
165+
166+
167+
### Speech-to-text Models Quantization
168+
169+
The speech-to-text Whisper model can be quantized without the need for preparing a custom calibration dataset. Please see example below.
170+
171+
```python
172+
model_id = "openai/whisper-tiny"
173+
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
174+
model_id,
175+
quantization_config=OVQuantizationConfig(
176+
num_samples=10,
177+
dataset="librispeech",
178+
processor=model_id,
179+
matmul_sq_alpha=0.95,
180+
)
181+
)
182+
```
183+
184+
With this, encoder, decoder and decoder-with-past models of the Whisper pipeline will be fully quantized, including activations.
185+
186+
## Hybrid quantization
187+
188+
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
189+
The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
190+
Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components :
191+
* U-Net : quantization applied on both the weights and activations
192+
* The text encoder, VAE encoder / decoder : quantization applied on the weights
193+
194+
The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
195+
196+
The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.
197+
198+
```python
199+
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
200+
201+
model = OVStableDiffusionPipeline.from_pretrained(
202+
model_id,
203+
export=True,
204+
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
205+
)
206+
```
207+
208+
209+
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).

0 commit comments

Comments
 (0)