Skip to content

Commit de581ea

Browse files
authored
Merge branch 'main' into jit
2 parents 248f0d2 + 72b0630 commit de581ea

27 files changed

+699
-114
lines changed

.github/workflows/test_openvino.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
python -m pip install --upgrade pip
3333
# install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
3434
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
35-
pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
35+
pip install .[openvino,openvino-tokenizers,tests,diffusers] onnxruntime
3636
- name: Test with Pytest
3737
run: |
3838
pytest tests/openvino/ --ignore test_modeling_basic

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010

1111
Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
1212

13-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
13+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
1414

1515

1616
## Installation
@@ -20,7 +20,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2020
| Accelerator | Installation |
2121
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
2222
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
23-
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
23+
| [OpenVINO](https://docs.openvino.ai) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2424
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
@@ -68,11 +68,11 @@ For more details on the supported compression techniques, please refer to the [d
6868

6969
## OpenVINO
7070

71-
Below are the examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/latest/tmo_introduction.html) framework to accelerate inference.
71+
Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/compressing-models-during-training.html) framework to accelerate inference.
7272

7373
#### Export:
7474

75-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
75+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
7676

7777
```plain
7878
optimum-cli export openvino --model gpt2 ov_model

docs/source/index.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ limitations under the License.
2121

2222
[Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
2323

24-
[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
24+
[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
2525

2626
<div class="mt-10">
2727
<div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
@@ -34,4 +34,4 @@ limitations under the License.
3434
<p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
3535
</a>
3636
</div>
37-
</div>
37+
</div>

docs/source/inference.mdx

+6-5
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@ Optimum Intel can be used to load optimized models from the [Hugging Face Hub](h
1313

1414
## Transformers models
1515

16-
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices).
16+
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors
17+
([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices).
1718
For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
1819

1920
As shown in the table below, each task is associated with a class enabling to automatically load your model.
@@ -33,7 +34,7 @@ As shown in the table below, each task is associated with a class enabling to au
3334

3435
### Export
3536

36-
It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
37+
It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
3738

3839
```bash
3940
optimum-cli export openvino --model gpt2 ov_model
@@ -182,7 +183,7 @@ model.reshape(1,128)
182183
model.compile()
183184
```
184185

185-
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing drivers for GPU inference).
186+
To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
186187

187188
```python
188189
# Static shapes speed up inference
@@ -471,15 +472,15 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
471472
```
472473

473474

474-
## Latent Consistency Models
475+
### Latent Consistency Models
475476

476477

477478
| Task | Auto Class |
478479
|--------------------------------------|--------------------------------------|
479480
| `text-to-image` | `OVLatentConsistencyModelPipeline` |
480481

481482

482-
### Text-to-Image
483+
#### Text-to-Image
483484

484485
Here is an example of how you can load a Latent Consistency Models (LCMs) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :
485486

docs/source/installation.mdx

+2-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
2121
| Accelerator | Installation |
2222
|:-----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
2323
| [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`|
24-
| [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
24+
| [Intel OpenVINO](https://docs.openvino.ai ) | `pip install --upgrade-strategy eager "optimum[openvino]"` |
2525

2626
The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
2727

@@ -42,4 +42,4 @@ or to install from source including dependencies:
4242
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
4343
```
4444

45-
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
45+
where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.

docs/source/optimization_ov.mdx

+11-1
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,17 @@ from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
8282

8383
model = OVModelForCausalLM.from_pretrained(
8484
model_id,
85-
export=True,
85+
quantization_config=OVWeightQuantizationConfig(bits=4),
86+
)
87+
```
88+
89+
You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
90+
91+
```python
92+
from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
93+
94+
model = OVModelForCausalLM.from_pretrained(
95+
model_id,
8696
quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
8797
)
8898
```

optimum/commands/export/openvino.py

+9-10
Original file line numberDiff line numberDiff line change
@@ -157,13 +157,12 @@ def run(self):
157157
)
158158
self.args.weight_format = "int8"
159159

160-
weight_format = self.args.weight_format or "fp32"
161-
162-
ov_config = None
163-
if weight_format in {"fp16", "fp32"}:
164-
ov_config = OVConfig(dtype=weight_format)
160+
if self.args.weight_format is None:
161+
ov_config = None
162+
elif self.args.weight_format in {"fp16", "fp32"}:
163+
ov_config = OVConfig(dtype=self.args.weight_format)
165164
else:
166-
is_int8 = weight_format == "int8"
165+
is_int8 = self.args.weight_format == "int8"
167166

168167
# For int4 quantization if not parameter is provided, then use the default config if exist
169168
if (
@@ -182,12 +181,12 @@ def run(self):
182181
"group_size": -1 if is_int8 else self.args.group_size,
183182
}
184183

185-
if weight_format in {"int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"}:
184+
if self.args.weight_format in {"int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"}:
186185
logger.warning(
187-
f"--weight-format {weight_format} is deprecated, possible choices are fp32, fp16, int8, int4"
186+
f"--weight-format {self.args.weight_format} is deprecated, possible choices are fp32, fp16, int8, int4"
188187
)
189-
quantization_config["sym"] = "asym" not in weight_format
190-
quantization_config["group_size"] = 128 if "128" in weight_format else 64
188+
quantization_config["sym"] = "asym" not in self.args.weight_format
189+
quantization_config["group_size"] = 128 if "128" in self.args.weight_format else 64
191190
ov_config = OVConfig(quantization_config=quantization_config)
192191

193192
# TODO : add input shapes

optimum/exporters/ipex/__init__.py

Whitespace-only changes.
+91
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from transformers.models.llama.modeling_llama import (
16+
LlamaAttention,
17+
LlamaDecoderLayer,
18+
LlamaForCausalLM,
19+
LlamaModel,
20+
LlamaRMSNorm,
21+
)
22+
23+
from optimum.intel.utils.import_utils import is_ipex_version
24+
25+
from .modeling_utils import (
26+
_IPEXLlamaDecoderLayerRef,
27+
_llama_attn_forward,
28+
_llama_layer_norm_forward,
29+
_llama_model_forward,
30+
)
31+
32+
33+
_IPEX_EXPORTED_ARCH = ("LlamaForCausalLM",)
34+
_IPEX_EXPORTED_TASK = ("text-generation",)
35+
36+
37+
def convert_func(m, func_name, new_function):
38+
bound_method = new_function.__get__(m, m.__class__)
39+
setattr(m, func_name, bound_method)
40+
41+
42+
def convert_functions(m, target_m, new_function_name, new_function):
43+
for _, sub_m in m.named_children():
44+
if isinstance(sub_m, target_m):
45+
convert_func(sub_m, new_function_name, new_function)
46+
convert_functions(sub_m, target_m, new_function_name, new_function)
47+
48+
49+
def convert_class(m, target_m, new_class, config, distributed=False):
50+
for name, sub_m in m.named_children():
51+
if isinstance(sub_m, target_m):
52+
new_m = new_class(sub_m, config, distributed)
53+
setattr(m, name, new_m)
54+
convert_class(sub_m, target_m, new_class, config, distributed)
55+
56+
57+
def patch_op(m, target_m, new_op_name, new_op):
58+
for name, sub_m in m.named_children():
59+
if isinstance(sub_m, target_m):
60+
setattr(sub_m, new_op_name, new_op)
61+
patch_op(sub_m, target_m, new_op_name, new_op)
62+
63+
64+
def _patch_llama_model(model):
65+
if is_ipex_version("<", "2.5.0"):
66+
raise ImportError("Only ipex version > 2.3.0 supports RotaryEmbedding and IndirectAccessKVCache")
67+
68+
from intel_extension_for_pytorch.llm.modules import IndirectAccessKVCache, RotaryEmbedding
69+
70+
ipex_rope = RotaryEmbedding(
71+
model.config.max_position_embeddings,
72+
model.config.hidden_size // model.config.num_attention_heads,
73+
model.config.rope_theta,
74+
model.config.architectures[0],
75+
)
76+
ipex_scale_dot_product = IndirectAccessKVCache(text_max_length=model.config.max_position_embeddings)
77+
patch_op(model, LlamaAttention, "ipex_rope", ipex_rope)
78+
patch_op(model, LlamaAttention, "ipex_scale_dot_product", ipex_scale_dot_product)
79+
80+
convert_functions(model, LlamaModel, "forward", _llama_model_forward)
81+
convert_functions(model, LlamaAttention, "forward", _llama_attn_forward)
82+
convert_functions(model, LlamaRMSNorm, "forward", _llama_layer_norm_forward)
83+
84+
convert_class(model, LlamaDecoderLayer, _IPEXLlamaDecoderLayerRef, model.config)
85+
return model
86+
87+
88+
def _patch_model(model):
89+
if isinstance(model, LlamaForCausalLM):
90+
model = _patch_llama_model(model)
91+
return model

0 commit comments

Comments
 (0)