Skip to content

Commit 481f369

Browse files
authored
Merge branch 'huggingface:main' into ipex_readme
2 parents fe01c57 + d2f9fdb commit 481f369

37 files changed

+3055
-1624
lines changed

.github/workflows/test_openvino.yml

+6
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,9 @@ jobs:
3636
- name: Test with Pytest
3737
run: |
3838
pytest tests/openvino/ --ignore test_modeling_basic
39+
- name: Test openvino-nightly
40+
run: |
41+
pip uninstall -y openvino
42+
pip install openvino-nightly
43+
python -c "from optimum.intel import OVModelForCausalLM; OVModelForCausalLM.from_pretrained('hf-internal-testing/tiny-random-gpt2', export=True, compile=False)"
44+
optimum-cli export openvino -m hf-internal-testing/tiny-random-gpt2 gpt2-ov
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
name: OpenVINO - Examples Test
2+
3+
on:
4+
workflow_dispatch:
5+
schedule:
6+
- cron: 0 1 * * 1 # run weekly: every Monday at 1am
7+
push:
8+
paths:
9+
- '.github/workflows/test_openvino_examples.yml'
10+
- 'examples/openvino/*'
11+
pull_request:
12+
paths:
13+
- '.github/workflows/test_openvino_examples.yml'
14+
- 'examples/openvino/*'
15+
16+
concurrency:
17+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
18+
cancel-in-progress: true
19+
20+
jobs:
21+
build:
22+
strategy:
23+
fail-fast: false
24+
matrix:
25+
python-version: ["3.8", "3.10"]
26+
27+
runs-on: ubuntu-20.04
28+
29+
steps:
30+
- uses: actions/checkout@v2
31+
- name: Setup Python ${{ matrix.python-version }}
32+
uses: actions/setup-python@v2
33+
with:
34+
python-version: ${{ matrix.python-version }}
35+
36+
- name: Install dependencies
37+
run: |
38+
pip install optimum[openvino] jstyleson nncf pytest
39+
pip install -r examples/openvino/audio-classification/requirements.txt
40+
pip install -r examples/openvino/image-classification/requirements.txt
41+
pip install -r examples/openvino/question-answering/requirements.txt
42+
pip install -r examples/openvino/text-classification/requirements.txt
43+
44+
- name: Test examples
45+
run: |
46+
python -m pytest examples/openvino/test_examples.py

.github/workflows/test_openvino_notebooks.yml

+2
Original file line numberDiff line numberDiff line change
@@ -49,5 +49,7 @@ jobs:
4949

5050
- name: Test with Pytest
5151
run: |
52+
sed -i 's/NUM_TRAIN_ITEMS = 600/NUM_TRAIN_ITEMS = 10/' notebooks/openvino/question_answering_quantization.ipynb
53+
sed -i 's/# %pip install/%pip install/' notebooks/openvino/optimum_openvino_inference.ipynb
5254
python -m pytest --nbval-lax notebooks/openvino/optimum_openvino_inference.ipynb notebooks/openvino/question_answering_quantization.ipynb
5355

Makefile

+2-2
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ REAL_CLONE_URL = $(if $(CLONE_URL),$(CLONE_URL),$(DEFAULT_CLONE_URL))
2222
# Run code quality checks
2323
style_check:
2424
black --check .
25-
ruff .
25+
ruff check .
2626

2727
style:
2828
black .
29-
ruff . --fix
29+
ruff check . --fix
3030

3131
# Run tests for the library
3232
test:

docs/source/inference.mdx

+6-5
Original file line numberDiff line numberDiff line change
@@ -99,21 +99,22 @@ tokenizer.save_pretrained(save_directory)
9999

100100
### Weight-only quantization
101101

102-
You can also apply 8-bit or 4-bit weight quantization when exporting your model with the CLI by setting the `weight-format` argument to respectively `int8` or `int4`:
102+
You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
103103

104104
```bash
105105
optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
106106
```
107107

108-
This will result in the exported model linear and embedding layers to be quantized to INT8 or INT4, the activations will be kept in floating point precision. This type of optimization allows reducing the footprint and latency of LLMs.
108+
This type of optimization allows to reduce the memory footprint and inference latency.
109109

110-
By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
110+
111+
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
111112

112113
For INT4 quantization you can also specify the following arguments :
113114
* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
114115
* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
115116

116-
Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.
117+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
117118

118119
You can also apply 8-bit quantization on your model's weight when loading your model by setting the `load_in_8bit=True` argument when calling the `from_pretrained()` method.
119120

@@ -125,7 +126,7 @@ model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
125126

126127
<Tip warning={true}>
127128

128-
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
129+
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. You can disable it with `load_in_8bit=False`.
129130

130131
</Tip>
131132

docs/source/optimization_ov.mdx

+90-42
Original file line numberDiff line numberDiff line change
@@ -19,27 +19,95 @@ limitations under the License.
1919
🤗 Optimum Intel provides an `openvino` package that enables you to apply a variety of model compression methods such as quantization, pruning, on many models hosted on the 🤗 hub using the [NNCF](https://docs.openvino.ai/2022.1/docs_nncf_introduction.html) framework.
2020

2121

22-
## Post-training optimization
22+
## Post-training
2323

24-
Post-training static quantization introduces an additional calibration step where data is fed through the network in order to compute the activations quantization parameters.
25-
Here is how to apply static quantization on a fine-tuned DistilBERT:
24+
Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.
25+
26+
### Weight-only quantization
27+
28+
Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
29+
30+
31+
#### 8-bit
32+
33+
For the 8-bit weight quantization you can set `load_in_8bit=True` to load your model's weights in 8-bit:
2634

2735
```python
28-
from functools import partial
29-
from transformers import AutoTokenizer
30-
from optimum.intel import OVConfig, OVQuantizer, OVModelForSequenceClassification,
36+
from optimum.intel import OVModelForCausalLM
37+
38+
model_id = "helenai/gpt2-ov"
39+
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
40+
41+
# Saves the int8 model that will be x4 smaller than its fp32 counterpart
42+
model.save_pretrained(saving_directory)
43+
```
44+
45+
<Tip warning={true}>
46+
47+
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. You can disable it with `load_in_8bit=False`.
48+
49+
</Tip>
50+
51+
You can also provide a `quantization_config` instead to specify additional optimization parameters.
52+
53+
#### 4-bit
54+
55+
For the 4-bit weight quantization, you need a `quantization_config` to define the optimization parameters, for example:
56+
57+
```python
58+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
59+
60+
quantization_config = OVWeightQuantizationConfig(bits=4)
61+
model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
62+
```
63+
64+
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
65+
66+
```python
67+
quantization_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb")
68+
```
69+
70+
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
71+
72+
For 4-bit quantization you can also specify the following arguments in the quantization configuration :
73+
* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
74+
* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
75+
76+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
77+
78+
### Static quantization
79+
80+
When applying post-training static quantization, both the weights and the activations are quantized.
81+
To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
82+
83+
Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
84+
85+
```python
86+
from transformers import AutoTokenizer
87+
from optimum.intel import OVQuantizer, OVModelForSequenceClassification,
3188

3289
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
3390
model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
3491
tokenizer = AutoTokenizer.from_pretrained(model_id)
3592
# The directory where the quantized model will be saved
3693
save_dir = "ptq_model"
3794

95+
quantizer = OVQuantizer.from_pretrained(model)
96+
97+
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
98+
quantizer.quantize(calibration_dataset=calibration_dataset, save_directory=save_dir)
99+
# Save the tokenizer
100+
tokenizer.save_pretrained(save_dir)
101+
```
102+
103+
The calibration dataset can also be created easily using your `OVQuantizer`:
104+
105+
```python
106+
from functools import partial
107+
38108
def preprocess_function(examples, tokenizer):
39109
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
40110

41-
# Instantiate our OVQuantizer using the desired configuration
42-
quantizer = OVQuantizer.from_pretrained(model)
43111
# Create the calibration dataset used to perform static quantization
44112
calibration_dataset = quantizer.get_calibration_dataset(
45113
"glue",
@@ -48,59 +116,39 @@ calibration_dataset = quantizer.get_calibration_dataset(
48116
num_samples=300,
49117
dataset_split="train",
50118
)
51-
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
52-
quantizer.quantize(
53-
calibration_dataset=calibration_dataset,
54-
save_directory=save_dir,
55-
)
56-
# Save the tokenizer
57-
tokenizer.save_pretrained(save_dir)
58119
```
59120

60-
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
61121

62-
## Weight-only quantization
122+
The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
63123

64-
You can optimize the performance of text-generation LLMs by quantizing weights to various precisions that provide different performance-accuracy trade-offs.
65124

66-
```python
67-
from optimum.intel import OVModelForCausalLM
125+
### Hybrid quantization
68126

69-
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
70-
```
127+
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
128+
The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
129+
Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components :
130+
* U-Net : quantization applied on both the weights and activations
131+
* The text encoder, VAE encoder / decoder : quantization applied on the weights
71132

72-
<Tip warning={true}>
73-
74-
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
75-
76-
</Tip>
133+
The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
77134

78-
For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:
135+
The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.
79136

80137
```python
81-
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
138+
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
82139

83-
model = OVModelForCausalLM.from_pretrained(
140+
model = OVStableDiffusionPipeline.from_pretrained(
84141
model_id,
85-
quantization_config=OVWeightQuantizationConfig(bits=4),
142+
export=True,
143+
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
86144
)
87145
```
88146

89-
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
90-
91-
```python
92-
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
93-
94-
model = OVModelForCausalLM.from_pretrained(
95-
model_id,
96-
quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
97-
)
98-
```
99147

100148
For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
101149

102150

103-
## Training-time optimization
151+
## Training-time
104152

105153
Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
106154

examples/openvino/image-classification/run_image_classification.py

+8-10
Original file line numberDiff line numberDiff line change
@@ -151,12 +151,12 @@ class ModelArguments:
151151
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
152152
)
153153
feature_extractor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
154-
use_auth_token: bool = field(
155-
default=False,
154+
token: str = field(
155+
default=None,
156156
metadata={
157157
"help": (
158-
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
159-
"with private models)."
158+
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
159+
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
160160
)
161161
},
162162
)
@@ -239,8 +239,7 @@ def main():
239239
data_args.dataset_name,
240240
data_args.dataset_config_name,
241241
cache_dir=model_args.cache_dir,
242-
task="image-classification",
243-
use_auth_token=True if model_args.use_auth_token else None,
242+
token=model_args.token,
244243
)
245244
else:
246245
data_files = {}
@@ -252,7 +251,6 @@ def main():
252251
"imagefolder",
253252
data_files=data_files,
254253
cache_dir=model_args.cache_dir,
255-
task="image-classification",
256254
)
257255

258256
# If we don't have a validation split, split off a percentage of train as validation.
@@ -287,15 +285,15 @@ def compute_metrics(p):
287285
finetuning_task="image-classification",
288286
cache_dir=model_args.cache_dir,
289287
revision=model_args.model_revision,
290-
use_auth_token=True if model_args.use_auth_token else None,
288+
token=model_args.token,
291289
)
292290
model = AutoModelForImageClassification.from_pretrained(
293291
model_args.model_name_or_path,
294292
from_tf=bool(".ckpt" in model_args.model_name_or_path),
295293
config=config,
296294
cache_dir=model_args.cache_dir,
297295
revision=model_args.model_revision,
298-
use_auth_token=True if model_args.use_auth_token else None,
296+
token=model_args.token,
299297
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
300298
)
301299

@@ -311,7 +309,7 @@ def compute_metrics(p):
311309
model_args.feature_extractor_name or model_args.model_name_or_path,
312310
cache_dir=model_args.cache_dir,
313311
revision=model_args.model_revision,
314-
use_auth_token=True if model_args.use_auth_token else None,
312+
token=model_args.token,
315313
)
316314

317315
# Define torchvision transforms to be applied to each image.

0 commit comments

Comments
 (0)