Skip to content

Commit d9e3a0f

Browse files
authored
Merge branch 'main' into del-convert-tokenizer-flag
2 parents 40f227d + 358f389 commit d9e3a0f

38 files changed

+1297
-272
lines changed

.github/workflows/test_openvino.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
python -m pip install --upgrade pip
3333
# install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
3434
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
35-
pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
35+
pip install .[openvino,openvino-tokenizers,tests,diffusers] onnxruntime
3636
- name: Test with Pytest
3737
run: |
3838
pytest tests/openvino/ --ignore test_modeling_basic

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
1212

13-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
13+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
1414

1515

1616
## Installation
@@ -20,7 +20,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2020
| Accelerator | Installation |
2121
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
2222
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
23-
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
23+
| [OpenVINO](https://docs.openvino.ai) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2424
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
@@ -68,11 +68,11 @@ For more details on the supported compression techniques, please refer to the [d
6868

6969
## OpenVINO
7070

71-
Below are the examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/latest/tmo_introduction.html) framework to accelerate inference.
71+
Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/compressing-models-during-training.html) framework to accelerate inference.
7272

7373
#### Export:
7474

75-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
75+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
7676

7777
```plain
7878
optimum-cli export openvino --model gpt2 ov_model

docs/source/index.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ limitations under the License.
2121

2222
[Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
2323

24-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
24+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
2525

2626
<div class="mt-10">
2727
<div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
@@ -34,4 +34,4 @@ limitations under the License.
3434
<p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
3535
</a>
3636
</div>
37-
</div>
37+
</div>

docs/source/inference.mdx

+11-7
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ Optimum Intel can be used to load optimized models from the [Hugging Face Hub](h
1313

1414
## Transformers models
1515

16-
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices).
16+
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors
17+
([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices).
1718
For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
1819

1920
As shown in the table below, each task is associated with a class enabling to automatically load your model.
@@ -33,7 +34,7 @@ As shown in the table below, each task is associated with a class enabling to au
3334

3435
### Export
3536

36-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
37+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
3738

3839
```bash
3940
optimum-cli export openvino --model gpt2 ov_model
@@ -110,7 +111,7 @@ By default the quantization scheme will be [assymmetric](https://github.com/open
110111

111112
For INT4 quantization you can also specify the following arguments :
112113
* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
113-
* The `--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
114+
* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
114115

115116
Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.
116117

@@ -122,8 +123,11 @@ from optimum.intel import OVModelForCausalLM
122123
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
123124
```
124125

125-
> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
126+
<Tip warning={true}>
126127

128+
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
129+
130+
</Tip>
127131

128132
To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).
129133

@@ -179,7 +183,7 @@ model.reshape(1,128)
179183
model.compile()
180184
```
181185

182-
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing drivers for GPU inference).
186+
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
183187

184188
```python
185189
# Static shapes speed up inference
@@ -468,15 +472,15 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
468472
```
469473

470474

471-
## Latent Consistency Models
475+
### Latent Consistency Models
472476

473477

474478
| Task | Auto Class |
475479
|--------------------------------------|--------------------------------------|
476480
| `text-to-image` | `OVLatentConsistencyModelPipeline` |
477481

478482

479-
### Text-to-Image
483+
#### Text-to-Image
480484

481485
Here is an example of how you can load a Latent Consistency Models (LCMs) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :
482486

docs/source/installation.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2121
| Accelerator | Installation |
2222
|:-----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
2323
| [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`|
24-
| [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
24+
| [Intel OpenVINO](https://docs.openvino.ai ) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
2727

@@ -42,4 +42,4 @@ or to install from source including dependencies:
4242
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
4343
```
4444

45-
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
45+
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.

docs/source/optimization_ov.mdx

+33-5
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,6 @@ save_dir = "ptq_model"
3838
def preprocess_function(examples, tokenizer):
3939
return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
4040

41-
# Load the default quantization configuration detailing the quantization we wish to apply
42-
quantization_config = OVConfig()
4341
# Instantiate our OVQuantizer using the desired configuration
4442
quantizer = OVQuantizer.from_pretrained(model)
4543
# Create the calibration dataset used to perform static quantization
@@ -52,7 +50,6 @@ calibration_dataset = quantizer.get_calibration_dataset(
5250
)
5351
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
5452
quantizer.quantize(
55-
quantization_config=quantization_config,
5653
calibration_dataset=calibration_dataset,
5754
save_directory=save_dir,
5855
)
@@ -72,7 +69,28 @@ from optimum.intel import OVModelForCausalLM
7269
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
7370
```
7471

75-
> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters.
72+
## Hybrid quantization
73+
74+
Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
75+
The UNet model takes up most of the overall execution time of the pipeline. Thus, optimizing just one model brings substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial degradation of accuracy.
76+
Therefore, the proposal is to apply quantization in *hybrid mode* for the UNet model and weight-only quantization for the rest of the pipeline components. The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
77+
The `quantization_config` is utilized to define optimization parameters for optimizing the Stable Diffusion pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. Otherwise, weight-only quantization to a specified data type (8 tr 4 bits) is applied to UNet model.
78+
79+
```python
80+
from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
81+
82+
model = OVStableDiffusionPipeline.from_pretrained(
83+
model_id,
84+
export=True,
85+
quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
86+
)
87+
```
88+
89+
<Tip warning={true}>
90+
91+
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
92+
93+
</Tip>
7694

7795
For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:
7896

@@ -81,7 +99,17 @@ from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
8199

82100
model = OVModelForCausalLM.from_pretrained(
83101
model_id,
84-
export=True,
102+
quantization_config=OVWeightQuantizationConfig(bits=4),
103+
)
104+
```
105+
106+
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
107+
108+
```python
109+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
110+
111+
model = OVModelForCausalLM.from_pretrained(
112+
model_id,
85113
quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
86114
)
87115
```

examples/openvino/image-classification/run_image_classification.py

+8-10
Original file line numberDiff line numberDiff line change
@@ -151,12 +151,12 @@ class ModelArguments:
151151
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
152152
)
153153
feature_extractor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
154-
use_auth_token: bool = field(
155-
default=False,
154+
token: str = field(
155+
default=None,
156156
metadata={
157157
"help": (
158-
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
159-
"with private models)."
158+
"The token to use as HTTP bearer authorization for remote files. If not specified, will use the token "
159+
"generated when running `huggingface-cli login` (stored in `~/.huggingface`)."
160160
)
161161
},
162162
)
@@ -239,8 +239,7 @@ def main():
239239
data_args.dataset_name,
240240
data_args.dataset_config_name,
241241
cache_dir=model_args.cache_dir,
242-
task="image-classification",
243-
use_auth_token=True if model_args.use_auth_token else None,
242+
token=model_args.token,
244243
)
245244
else:
246245
data_files = {}
@@ -252,7 +251,6 @@ def main():
252251
"imagefolder",
253252
data_files=data_files,
254253
cache_dir=model_args.cache_dir,
255-
task="image-classification",
256254
)
257255

258256
# If we don't have a validation split, split off a percentage of train as validation.
@@ -287,15 +285,15 @@ def compute_metrics(p):
287285
finetuning_task="image-classification",
288286
cache_dir=model_args.cache_dir,
289287
revision=model_args.model_revision,
290-
use_auth_token=True if model_args.use_auth_token else None,
288+
token=model_args.token,
291289
)
292290
model = AutoModelForImageClassification.from_pretrained(
293291
model_args.model_name_or_path,
294292
from_tf=bool(".ckpt" in model_args.model_name_or_path),
295293
config=config,
296294
cache_dir=model_args.cache_dir,
297295
revision=model_args.model_revision,
298-
use_auth_token=True if model_args.use_auth_token else None,
296+
token=model_args.token,
299297
ignore_mismatched_sizes=model_args.ignore_mismatched_sizes,
300298
)
301299

@@ -311,7 +309,7 @@ def compute_metrics(p):
311309
model_args.feature_extractor_name or model_args.model_name_or_path,
312310
cache_dir=model_args.cache_dir,
313311
revision=model_args.model_revision,
314-
use_auth_token=True if model_args.use_auth_token else None,
312+
token=model_args.token,
315313
)
316314

317315
# Define torchvision transforms to be applied to each image.

0 commit comments

Comments
 (0)