Skip to content

Commit 4bb1370

Browse files
committed
merge with main
2 parents 6667661 + e036ce6 commit 4bb1370

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+3203
-599
lines changed

.github/workflows/test_openvino.yml

+7-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,13 @@ jobs:
3232
python -m pip install --upgrade pip
3333
# install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
3434
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
35-
pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
35+
pip install .[openvino,openvino-tokenizers,tests,diffusers] onnxruntime
3636
- name: Test with Pytest
3737
run: |
3838
pytest tests/openvino/ --ignore test_modeling_basic
39+
- name: Test openvino-nightly
40+
run: |
41+
pip uninstall -y openvino
42+
pip install openvino-nightly
43+
python -c "from optimum.intel import OVModelForCausalLM; OVModelForCausalLM.from_pretrained('hf-internal-testing/tiny-random-gpt2', export=True, compile=False)"
44+
optimum-cli export openvino -m hf-internal-testing/tiny-random-gpt2 gpt2-ov

.github/workflows/test_openvino_notebooks.yml

+2
Original file line numberDiff line numberDiff line change
@@ -49,5 +49,7 @@ jobs:
4949

5050
- name: Test with Pytest
5151
run: |
52+
sed -i 's/NUM_TRAIN_ITEMS = 600/NUM_TRAIN_ITEMS = 10/' notebooks/openvino/question_answering_quantization.ipynb
53+
sed -i 's/# %pip install/%pip install/' notebooks/openvino/optimum_openvino_inference.ipynb
5254
python -m pytest --nbval-lax notebooks/openvino/optimum_openvino_inference.ipynb notebooks/openvino/question_answering_quantization.ipynb
5355

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
1212

13-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
13+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
1414

1515

1616
## Installation
@@ -20,7 +20,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2020
| Accelerator | Installation |
2121
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
2222
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
23-
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
23+
| [OpenVINO](https://docs.openvino.ai) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2424
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
@@ -68,11 +68,11 @@ For more details on the supported compression techniques, please refer to the [d
6868

6969
## OpenVINO
7070

71-
Below are the examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/latest/tmo_introduction.html) framework to accelerate inference.
71+
Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/compressing-models-during-training.html) framework to accelerate inference.
7272

7373
#### Export:
7474

75-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
75+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
7676

7777
```plain
7878
optimum-cli export openvino --model gpt2 ov_model

docs/source/index.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ limitations under the License.
2121

2222
[Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
2323

24-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
24+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
2525

2626
<div class="mt-10">
2727
<div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
@@ -34,4 +34,4 @@ limitations under the License.
3434
<p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
3535
</a>
3636
</div>
37-
</div>
37+
</div>

docs/source/inference.mdx

+16-11
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ Optimum Intel can be used to load optimized models from the [Hugging Face Hub](h
1313

1414
## Transformers models
1515

16-
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices).
16+
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors
17+
([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices).
1718
For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
1819

1920
As shown in the table below, each task is associated with a class enabling to automatically load your model.
@@ -33,7 +34,7 @@ As shown in the table below, each task is associated with a class enabling to au
3334

3435
### Export
3536

36-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
37+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
3738

3839
```bash
3940
optimum-cli export openvino --model gpt2 ov_model
@@ -98,21 +99,22 @@ tokenizer.save_pretrained(save_directory)
9899

99100
### Weight-only quantization
100101

101-
You can also apply 8-bit or 4-bit weight quantization when exporting your model with the CLI by setting the `weight-format` argument to respectively `int8` or `int4`:
102+
You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
102103

103104
```bash
104105
optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
105106
```
106107

107-
This will result in the exported model linear and embedding layers to be quantized to INT8 or INT4, the activations will be kept in floating point precision. This type of optimization allows reducing the footprint and latency of LLMs.
108+
This type of optimization allows to reduce the memory footprint and inference latency.
108109

109-
By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
110+
111+
By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
110112

111113
For INT4 quantization you can also specify the following arguments :
112114
* The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
113-
* The `--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
115+
* The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
114116

115-
Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.
117+
Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
116118

117119
You can also apply 8-bit quantization on your model's weight when loading your model by setting the `load_in_8bit=True` argument when calling the `from_pretrained()` method.
118120

@@ -122,8 +124,11 @@ from optimum.intel import OVModelForCausalLM
122124
model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
123125
```
124126

125-
> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
127+
<Tip warning={true}>
128+
129+
`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. You can disable it with `load_in_8bit=False`.
126130

131+
</Tip>
127132

128133
To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).
129134

@@ -179,7 +184,7 @@ model.reshape(1,128)
179184
model.compile()
180185
```
181186

182-
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing drivers for GPU inference).
187+
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
183188

184189
```python
185190
# Static shapes speed up inference
@@ -468,15 +473,15 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
468473
```
469474

470475

471-
## Latent Consistency Models
476+
### Latent Consistency Models
472477

473478

474479
| Task | Auto Class |
475480
|--------------------------------------|--------------------------------------|
476481
| `text-to-image` | `OVLatentConsistencyModelPipeline` |
477482

478483

479-
### Text-to-Image
484+
#### Text-to-Image
480485

481486
Here is an example of how you can load a Latent Consistency Models (LCMs) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :
482487

docs/source/installation.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2121
| Accelerator | Installation |
2222
|:-----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
2323
| [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`|
24-
| [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
24+
| [Intel OpenVINO](https://docs.openvino.ai ) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
2727

@@ -42,4 +42,4 @@ or to install from source including dependencies:
4242
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
4343
```
4444

45-
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
45+
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.

0 commit comments

Comments
 (0)