PenghuiCheng
diff --git a/‎.github/workflows/test_openvino.yml
+2-8 b/‎.github/workflows/test_openvino.yml
+2-8
diff --git a/‎README.md
+6-3 b/‎README.md
+6-3
diff --git a/‎docs/source/inference.mdx
+27-13 b/‎docs/source/inference.mdx
+27-13
diff --git a/‎docs/source/optimization_ov.mdx
+30 b/‎docs/source/optimization_ov.mdx
+30
diff --git a/‎docs/source/reference_inc.mdx
+2-2 b/‎docs/source/reference_inc.mdx
+2-2
diff --git a/‎examples/openvino/stable-diffusion/requirements.txt
+1-1 b/‎examples/openvino/stable-diffusion/requirements.txt
+1-1
diff --git a/‎examples/openvino/stable-diffusion/train_text_to_image_qat.py
+9-58 b/‎examples/openvino/stable-diffusion/train_text_to_image_qat.py
+9-58
diff --git a/‎optimum/commands/export/openvino.py
+19-1 b/‎optimum/commands/export/openvino.py
+19-1
diff --git a/‎optimum/exporters/openvino/__init__.py
+2-1 b/‎optimum/exporters/openvino/__init__.py
+2-1
@@ -17,7 +17,7 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.8, 3.9]
+        python-version: [3.8, 3.11]
         os: [ubuntu-latest]
 
     runs-on: ${{ matrix.os }}
@@ -32,13 +32,7 @@ jobs:
         python -m pip install --upgrade pip
         # install PyTorch CPU version to avoid installing CUDA packages on GitHub runner without GPU
         pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-        pip install .[openvino,nncf,tests,diffusers]
+        pip install .[openvino,openvino-tokenizers,nncf,tests,diffusers]
     - name: Test with Pytest
       run: |
         pytest tests/openvino/ --ignore test_modeling_basic
-    - name: Test openvino-nightly import
-      run: |
-        pip uninstall -y openvino
-        pip install openvino-nightly
-        python -c "from optimum.intel import OVModelForCausalLM; OVModelForCausalLM.from_pretrained('hf-internal-testing/tiny-random-gpt2', export=True, compile=False)"
-
@@ -6,6 +6,8 @@
 
 🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
 
+[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) is an open-source library which provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch* normally yields better performance from optimization techniques, such as operation fusion.
+
 Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
 
 [OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
@@ -19,6 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
 |:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
 | [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`  |
 | [OpenVINO](https://docs.openvino.ai/latest/index.html)                                                           | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"`      |
+| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction)                 | `pip install --upgrade-strategy eager "optimum[ipex]"`               |
 
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
 
@@ -37,7 +40,7 @@ or to install from source including dependencies:
 python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
 ```
 
-where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
+where `extras` can be one or more of `ipex`, `neural-compressor`, `openvino`, `nncf`.
 
 # Quick tour
 
@@ -75,10 +78,10 @@ It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2
 optimum-cli export openvino --model gpt2 ov_model
 ```
 
-If you add `--int8`, the model linear and embedding weights will be quantized to INT8, the activations will be kept in floating point precision.
+You can also apply 8-bit weight-only quantization when exporting your model : the model linear and embedding weights will be quantized to INT8, the activations will be kept in floating point precision.
 
 ```plain
-optimum-cli export openvino --model gpt2 --int8 ov_model
+optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
 ```
 
 To apply quantization on both weights and activations, you can find more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov).
 
@@ -50,19 +50,19 @@ optimum-cli export openvino --model local_path --task text-generation-with-past
 Once the model is exported, you can load the OpenVINO model using :
 
 ```python
-from optimum.intel import AutoModelForCausalLM
+from optimum.intel import OVModelForCausalLM
 
-model_id = "helenai/gpt2-ov"
-model = AutoModelForCausalLM.from_pretrained(model_id)
+model_id = "ov_model"
+model = OVModelForCausalLM.from_pretrained(model_id)
 ```
 
 You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.
 
 ```python
-from optimum.intel import AutoModelForCausalLM
+from optimum.intel import OVModelForCausalLM
 
 model_id = "gpt2"
-model = AutoModelForCausalLM.from_pretrained(model_id, export=True)
+model = OVModelForCausalLM.from_pretrained(model_id, export=True)
 model.save_pretrained("ov_model")
 ```
 
@@ -94,15 +94,15 @@ model.save_pretrained(save_directory)
 tokenizer.save_pretrained(save_directory)
 ```
 
-### Weight only quantization
+### Weight-only quantization
 
-You can also apply INT8 quantization on your models weights when exporting your model with the CLI:
+You can also apply 8-bit or 4-bit weight quantization when exporting your model with the CLI:
 
 ```bash
-optimum-cli export openvino --model gpt2 --int8 ov_model
+optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
 ```
 
-This will results in the exported model linear and embedding layers to be quantized to INT8, the activations will be kept in floating point precision.
+This will result in the exported model linear and embedding layers to be quantized to INT8 or INT4, the activations will be kept in floating point precision. This type of optimization allows reducing the footprint and latency of LLMs.
 
 This can also be done when loading your model by setting the `load_in_8bit` argument when calling the `from_pretrained()` method.
 
@@ -112,6 +112,21 @@ from optimum.intel import OVModelForCausalLM
 model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
 ```
 
+> **NOTE:** `load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
+
+There are also alternative compression options for a different performance-accuracy trade-off:
+
+| Option                                                              | Description       |
+|---------------------------------------------------------------------|-------------------|
+| `fp16`                                                              | Float16 weights   |
+| `int8`                                                              | INT8 weights      |
+| `int4_sym_g128`, `int4_asym_g128`, `int4_sym_g64`, `int4_asym_g64`* | INT4 weights      |
+
+*`sym` and `asym` stand for symmetric and asymmetric quantization, `g128` and `g64` means the group size `128` and `64` respectively. 
+
+`--ratio` CLI parameter controls the ratio between 4-bit and 8-bit quantized layers and can also change performance-accuracy trade-off for the optimized model. It is valid only for INT4 quantization options.
+
+
 To apply quantization on both weights and activations, you can use the `OVQuantizer`, more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov#optimization).
 
 ### Static shape
@@ -186,11 +201,10 @@ It is possible to pass an `ov_config` parameter to `from_pretrained()` with cust
 model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"INFERENCE_PRECISION_HINT":"f32"})
 ```
 
-Optimum Intel leverages OpenVINO's model caching to speed up model compiling. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching, set `CACHE_DIR` to an empty string.
-
+Optimum Intel leverages OpenVINO's model caching to speed up model compiling on GPU. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching on GPU, set `CACHE_DIR` to an empty string.
 
 ```python
-model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"CACHE_DIR":""})
+model = OVModelForSequenceClassification.from_pretrained(model_id, device="GPU", ov_config={"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR":""})
 ```
 
 ### Sequence-to-sequence models
@@ -258,7 +272,7 @@ prompt = "sailing ship in storm by Rembrandt"
 images = pipeline(prompt).images
 ```
 
-To load your PyTorch model and convert it to OpenVINO on-the-fly, you can set `export=True`.
+To load your PyTorch model and convert it to OpenVINO on the fly, you can set `export=True`.
 
 ```python
 model_id = "runwayml/stable-diffusion-v1-5"
 
@@ -62,6 +62,36 @@ tokenizer.save_pretrained(save_dir)
 
 The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
 
+##  Weight-only quantization
+
+You can optimize the performance of text-generation LLMs by quantizing weights to various precisions that provide different performance-accuracy trade-offs.
+
+```python
+from optimum.intel import OVModelForCausalLM
+
+model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
+```
+
+> **NOTE:** `load_in_8bit` is enabled by default for models larger than 1 billion parameters.
+
+For the 4-bit weight quantization we recommend using the NNCF API like below:
+```python
+from optimum.intel import OVModelForCausalLM
+import nncf
+
+model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=False)
+model.model = nncf.compress_weights(
+        model.model,
+        mode=nncf.CompressWeightsMode.INT4_SYM,
+        ratio=0.8,
+        group_size=128,
+    )
+model.save_pretrained("compressed_model")
+```
+
+For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
+
+
 ## Training-time optimization
 
 Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
 
@@ -43,8 +43,8 @@ specific language governing permissions and limitations under the License.
 
 ## INCModelForCausalLM
 
-[[autodoc]] neural_compressor.modeling_decoder.INCModelForCausalLM
+[[autodoc]] neural_compressor.modeling_base.INCModelForCausalLM
 
 ## INCModelForSeq2SeqLM
 
-[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
+[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
@@ -2,4 +2,4 @@ accelerate
 diffusers
 torch~=1.13
 nncf @ git+https://github.com/openvinotoolkit/nncf.git
-tomesd @ git+https://github.com/AlexKoff88/tomesd/tree/openvino
+tomesd @ git+https://github.com/AlexKoff88/tomesd.git@openvino
@@ -19,7 +19,6 @@
 import math
 import os
 import random
-import tempfile
 from copy import deepcopy
 from functools import partial
 from io import BytesIO
@@ -34,7 +33,7 @@
 import torch.utils.checkpoint
 from accelerate import Accelerator
 from accelerate.logging import get_logger
-from accelerate.utils import set_seed
+from accelerate.utils import ProjectConfiguration, set_seed
 from datasets import load_dataset
 from diffusers import DDIMScheduler, DDPMScheduler, DiffusionPipeline, LMSDiscreteScheduler, StableDiffusionPipeline
 from diffusers.optimization import get_scheduler
@@ -44,20 +43,12 @@
 from nncf.torch import create_compressed_model, register_default_init_args
 from nncf.torch.initialization import PTInitializingDataLoader
 from nncf.torch.layer_utils import CompressionParameter
-from openvino._offline_transformations import apply_moc_transformations, compress_quantize_weights_transformation
 from PIL import Image
 from requests.packages.urllib3.exceptions import InsecureRequestWarning
 from torchvision import transforms
 from tqdm import tqdm
 
-from optimum.exporters.onnx import export_models, get_stable_diffusion_models_for_export
-from optimum.intel import OVStableDiffusionPipeline
-from optimum.utils import (
-    DIFFUSION_MODEL_TEXT_ENCODER_SUBFOLDER,
-    DIFFUSION_MODEL_UNET_SUBFOLDER,
-    DIFFUSION_MODEL_VAE_DECODER_SUBFOLDER,
-    DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER,
-)
+from optimum.exporters.openvino import export_from_model
 
 
 requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
@@ -583,47 +574,6 @@ def get_noise_scheduler(args):
     return noise_scheduler
 
 
-def export_to_onnx(pipeline, save_dir):
-    unet = pipeline.unet
-    vae = pipeline.vae
-    text_encoder = pipeline.text_encoder
-
-    unet.eval().cpu()
-    vae.eval().cpu()
-    text_encoder.eval().cpu()
-
-    ONNX_WEIGHTS_NAME = "model.onnx"
-
-    output_names = [
-        os.path.join(DIFFUSION_MODEL_TEXT_ENCODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
-        os.path.join(DIFFUSION_MODEL_UNET_SUBFOLDER, ONNX_WEIGHTS_NAME),
-        os.path.join(DIFFUSION_MODEL_VAE_ENCODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
-        os.path.join(DIFFUSION_MODEL_VAE_DECODER_SUBFOLDER, ONNX_WEIGHTS_NAME),
-    ]
-
-    with torch.no_grad():
-        models_and_onnx_configs = get_stable_diffusion_models_for_export(pipeline)
-        pipeline.save_config(save_dir)
-        export_models(
-            models_and_onnx_configs=models_and_onnx_configs, output_dir=Path(save_dir), output_names=output_names
-        )
-
-
-def export_to_openvino(pipeline, onnx_dir, save_dir):
-    ov_pipe = OVStableDiffusionPipeline.from_pretrained(
-        model_id=onnx_dir,
-        from_onnx=True,
-        model_save_dir=save_dir,
-        tokenizer=pipeline.tokenizer,
-        scheduler=pipeline.scheduler,
-        feature_extractor=pipeline.feature_extractor,
-        compile=False,
-    )
-    apply_moc_transformations(ov_pipe.unet.model, cf=False)
-    compress_quantize_weights_transformation(ov_pipe.unet.model)
-    ov_pipe.save_pretrained(save_dir)
-
-
 class UnetInitDataset(torch.utils.data.Dataset):
     def __init__(self, data):
         super().__init__()
@@ -700,7 +650,7 @@ def get_nncf_config(pipeline, dataloader, args):
                 "ignored_scopes": [
                     "{re}.*__add___[0-2]",
                     "{re}.*layer_norm_0",
-                    "{re}.*Attention.*/bmm_0",
+                    # "{re}.*Attention.*/bmm_0",
                     "{re}.*__truediv__*",
                     "{re}.*group_norm_0",
                     "{re}.*mul___[0-2]",
@@ -771,11 +721,13 @@ def main():
 
     logging_dir = os.path.join(args.output_dir, args.logging_dir)
 
+    accelerator_project_config = ProjectConfiguration(project_dir=args.output_dir, logging_dir=logging_dir)
+
     accelerator = Accelerator(
         gradient_accumulation_steps=args.gradient_accumulation_steps,
         mixed_precision=args.mixed_precision,
         log_with=args.report_to,
-        logging_dir=logging_dir,
+        project_config=accelerator_project_config,
     )
 
     logging.basicConfig(
@@ -922,7 +874,7 @@ def tokenize_captions(examples, is_train=True):
 
     with accelerator.main_process_first():
         if args.max_train_samples is not None:
-            dataset["train"] = dataset["train"].shuffle(seed=42, buffer_size=args.max_train_samples)
+            dataset["train"] = dataset["train"].shuffle(seed=42).select(range(args.max_train_samples))
         # Set the training transforms
         train_dataset = dataset["train"]
 
@@ -1132,9 +1084,8 @@ def collate_fn(examples):
         feature_extractor=pipeline.feature_extractor,
     )
 
-    with tempfile.TemporaryDirectory() as tmpdirname:
-        export_to_onnx(export_pipeline, tmpdirname)
-        export_to_openvino(export_pipeline, tmpdirname, Path(args.output_dir) / "openvino")
+    save_directory = Path(args.output_dir) / "openvino"
+    export_from_model(export_pipeline, output=save_directory, task="stable-diffusion")
 
 
 if __name__ == "__main__":
 
@@ -92,6 +92,22 @@ def parse_args_openvino(parser: "ArgumentParser"):
             "precision (by default 20%% in INT8). This helps to achieve better accuracy after weight compression."
         ),
     )
+    optional_group.add_argument(
+        "--disable-stateful",
+        action="store_true",
+        help=(
+            "Disable stateful converted models, stateless models will be generated instead. Stateful models are produced by default when this key is not used. "
+            "In stateful models all kv-cache inputs and outputs are hidden in the model and are not exposed as model inputs and outputs. "
+            "If --disable-stateful option is used, it may result in sub-optimal inference performance. "
+            "Use it when you intentionally want to use a stateless model, for example, to be compatible with existing "
+            "OpenVINO native inference code that expects kv-cache inputs and outputs in the model."
+        ),
+    )
+    optional_group.add_argument(
+        "--convert-tokenizer",
+        action="store_true",
+        help="Add converted tokenizer and detokenizer with OpenVINO Tokenizers",
+    )
 
 
 class OVExportCommand(BaseOptimumCLICommand):
@@ -138,6 +154,8 @@ def run(self):
             trust_remote_code=self.args.trust_remote_code,
             pad_token_id=self.args.pad_token_id,
             compression_option=self.args.weight_format,
-            compression_ratio=self.args.ratio
+            compression_ratio=self.args.ratio,
+            stateful=not self.args.disable_stateful,
+            convert_tokenizer=self.args.convert_tokenizer,
             # **input_shapes,
         )
@@ -1,5 +1,6 @@
 from .__main__ import main_export
-from .convert import export, export_models, export_pytorch_via_onnx
+from .convert import export, export_from_model, export_models, export_pytorch_via_onnx
+from .stateful import ensure_stateful_is_available, patch_stateful
 
 
 __all__ = ["main_export", "export", "export_models"]