huggingface
diff --git a/‎.github/workflows/test_openvino.yml
+1-1 b/‎.github/workflows/test_openvino.yml
+1-1
diff --git a/‎README.md
+4-4 b/‎README.md
+4-4
diff --git a/‎docs/source/index.mdx
+2-2 b/‎docs/source/index.mdx
+2-2
diff --git a/‎docs/source/inference.mdx
+6-5 b/‎docs/source/inference.mdx
+6-5
diff --git a/‎docs/source/installation.mdx
+2-2 b/‎docs/source/installation.mdx
+2-2
diff --git a/‎docs/source/optimization_ov.mdx
+11-1 b/‎docs/source/optimization_ov.mdx
+11-1
diff --git a/‎optimum/commands/export/openvino.py
+9-10 b/‎optimum/commands/export/openvino.py
+9-10
diff --git a/‎optimum/exporters/ipex/__init__.py b/‎optimum/exporters/ipex/__init__.py
diff --git a/‎optimum/exporters/ipex/model_patcher.py
+91 b/‎optimum/exporters/ipex/model_patcher.py
+91
@@ -32,7 +32,7 @@ jobs:
         python -m pip install --upgrade pip
         # install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
         pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
+        pip install .[openvino,openvino-tokenizers,tests,diffusers] onnxruntime
     - name: Test with Pytest
       run: |
         pytest tests/openvino/ --ignore test_modeling_basic
@@ -10,7 +10,7 @@
 
 Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
 
-[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
+[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
 
 
 ## Installation
@@ -20,7 +20,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
 | Accelerator                                                                                                      | Installation                                                         |
 |:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
 | [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`  |
-| [OpenVINO](https://docs.openvino.ai/latest/index.html)                                                           | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"`      |
+| [OpenVINO](https://docs.openvino.ai)                                                                             | `pip install --upgrade-strategy eager "optimum[openvino]"`           |
 | [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction)                 | `pip install --upgrade-strategy eager "optimum[ipex]"`               |
 
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
@@ -68,11 +68,11 @@ For more details on the supported compression techniques, please refer to the [d
 
 ## OpenVINO
 
-Below are the examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/latest/tmo_introduction.html) framework to accelerate inference.
+Below are examples of how to use OpenVINO and its [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization-guide/compressing-models-during-training.html) framework to accelerate inference.
 
 #### Export:
 
-It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
+It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
 
 ```plain
 optimum-cli export openvino --model gpt2 ov_model
 
@@ -21,7 +21,7 @@ limitations under the License.
 
 [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
 
-[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
+[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
 
 <div class="mt-10">
   <div class="w-full flex flex-col space-x-4 md:grid md:grid-cols-2 md:gap-x-5">
@@ -34,4 +34,4 @@ limitations under the License.
       <p class="text-gray-700">Learn how to run inference with OpenVINO Runtime and to apply quantization, pruning and knowledge distillation on your model to further speed up inference.</p>
     </a>
   </div>
-</div>
+</div>
@@ -13,7 +13,8 @@ Optimum Intel can be used to load optimized models from the [Hugging Face Hub](h
 
 ## Transformers models
 
-You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). 
+You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors
+([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). 
 For that, just replace the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.
 
 As shown in the table below, each task is associated with a class enabling to automatically load your model.
@@ -33,7 +34,7 @@ As shown in the table below, each task is associated with a class enabling to au
 
 ### Export
 
-It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2023.1/openvino_ir.html) IR format with the CLI :
+It is possible to export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :
 
 ```bash
 optimum-cli export openvino --model gpt2 ov_model
@@ -182,7 +183,7 @@ model.reshape(1,128)
 model.compile()
 ```
 
-To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/nightly/openvino_docs_install_guides_configurations_for_intel_gpu.html) about installing drivers for GPU inference).
+To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).
 
 ```python
 # Static shapes speed up inference
@@ -471,15 +472,15 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
 ```
 
 
-## Latent Consistency Models
+### Latent Consistency Models
 
 
 | Task                                 | Auto Class                           |
 |--------------------------------------|--------------------------------------|
 | `text-to-image`                      | `OVLatentConsistencyModelPipeline`   |
 
 
-### Text-to-Image
+#### Text-to-Image
 
 Here is an example of how you can load a Latent Consistency Models (LCMs) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :
 
 
@@ -21,7 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
 | Accelerator                                                                                                            | Installation                                                       |
 |:-----------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------|
 | [Intel Neural Compressor (INC)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`|
-| [Intel OpenVINO](https://docs.openvino.ai/latest/index.html)                                                           | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"`    |
+| [Intel OpenVINO](https://docs.openvino.ai )                                                                            | `pip install --upgrade-strategy eager "optimum[openvino]"`    |
 
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
 
@@ -42,4 +42,4 @@ or to install from source including dependencies:
 python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
 ```
 
-where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
+where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
@@ -82,7 +82,17 @@ from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
 
 model = OVModelForCausalLM.from_pretrained(
     model_id,
-    export=True,
+    quantization_config=OVWeightQuantizationConfig(bits=4),
+)
+```
+
+You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
+
+```python
+from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
+
+model = OVModelForCausalLM.from_pretrained(
+    model_id,
     quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
 ) 
 ```
 
@@ -157,13 +157,12 @@ def run(self):
             )
             self.args.weight_format = "int8"
 
-        weight_format = self.args.weight_format or "fp32"
-
-        ov_config = None
-        if weight_format in {"fp16", "fp32"}:
-            ov_config = OVConfig(dtype=weight_format)
+        if self.args.weight_format is None:
+            ov_config = None
+        elif self.args.weight_format in {"fp16", "fp32"}:
+            ov_config = OVConfig(dtype=self.args.weight_format)
         else:
-            is_int8 = weight_format == "int8"
+            is_int8 = self.args.weight_format == "int8"
 
             # For int4 quantization if not parameter is provided, then use the default config if exist
             if (
@@ -182,12 +181,12 @@ def run(self):
                     "group_size": -1 if is_int8 else self.args.group_size,
                 }
 
-            if weight_format in {"int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"}:
+            if self.args.weight_format in {"int4_sym_g128", "int4_asym_g128", "int4_sym_g64", "int4_asym_g64"}:
                 logger.warning(
-                    f"--weight-format {weight_format} is deprecated, possible choices are fp32, fp16, int8, int4"
+                    f"--weight-format {self.args.weight_format} is deprecated, possible choices are fp32, fp16, int8, int4"
                 )
-                quantization_config["sym"] = "asym" not in weight_format
-                quantization_config["group_size"] = 128 if "128" in weight_format else 64
+                quantization_config["sym"] = "asym" not in self.args.weight_format
+                quantization_config["group_size"] = 128 if "128" in self.args.weight_format else 64
             ov_config = OVConfig(quantization_config=quantization_config)
 
         # TODO : add input shapes
 
@@ -0,0 +1,91 @@
+#  Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+
+from transformers.models.llama.modeling_llama import (
+    LlamaAttention,
+    LlamaDecoderLayer,
+    LlamaForCausalLM,
+    LlamaModel,
+    LlamaRMSNorm,
+)
+
+from optimum.intel.utils.import_utils import is_ipex_version
+
+from .modeling_utils import (
+    _IPEXLlamaDecoderLayerRef,
+    _llama_attn_forward,
+    _llama_layer_norm_forward,
+    _llama_model_forward,
+)
+
+
+_IPEX_EXPORTED_ARCH = ("LlamaForCausalLM",)
+_IPEX_EXPORTED_TASK = ("text-generation",)
+
+
+def convert_func(m, func_name, new_function):
+    bound_method = new_function.__get__(m, m.__class__)
+    setattr(m, func_name, bound_method)
+
+
+def convert_functions(m, target_m, new_function_name, new_function):
+    for _, sub_m in m.named_children():
+        if isinstance(sub_m, target_m):
+            convert_func(sub_m, new_function_name, new_function)
+        convert_functions(sub_m, target_m, new_function_name, new_function)
+
+
+def convert_class(m, target_m, new_class, config, distributed=False):
+    for name, sub_m in m.named_children():
+        if isinstance(sub_m, target_m):
+            new_m = new_class(sub_m, config, distributed)
+            setattr(m, name, new_m)
+        convert_class(sub_m, target_m, new_class, config, distributed)
+
+
+def patch_op(m, target_m, new_op_name, new_op):
+    for name, sub_m in m.named_children():
+        if isinstance(sub_m, target_m):
+            setattr(sub_m, new_op_name, new_op)
+        patch_op(sub_m, target_m, new_op_name, new_op)
+
+
+def _patch_llama_model(model):
+    if is_ipex_version("<", "2.5.0"):
+        raise ImportError("Only ipex version > 2.3.0 supports RotaryEmbedding and IndirectAccessKVCache")
+
+    from intel_extension_for_pytorch.llm.modules import IndirectAccessKVCache, RotaryEmbedding
+
+    ipex_rope = RotaryEmbedding(
+        model.config.max_position_embeddings,
+        model.config.hidden_size // model.config.num_attention_heads,
+        model.config.rope_theta,
+        model.config.architectures[0],
+    )
+    ipex_scale_dot_product = IndirectAccessKVCache(text_max_length=model.config.max_position_embeddings)
+    patch_op(model, LlamaAttention, "ipex_rope", ipex_rope)
+    patch_op(model, LlamaAttention, "ipex_scale_dot_product", ipex_scale_dot_product)
+
+    convert_functions(model, LlamaModel, "forward", _llama_model_forward)
+    convert_functions(model, LlamaAttention, "forward", _llama_attn_forward)
+    convert_functions(model, LlamaRMSNorm, "forward", _llama_layer_norm_forward)
+
+    convert_class(model, LlamaDecoderLayer, _IPEXLlamaDecoderLayerRef, model.config)
+    return model
+
+
+def _patch_model(model):
+    if isinstance(model, LlamaForCausalLM):
+        model = _patch_llama_model(model)
+    return model