From 8239db78af4487b4efd2c621bf8e2aac19007afe Mon Sep 17 00:00:00 2001
From: Ella Charlaix <ella@huggingface.co>
Date: Tue, 12 Mar 2024 15:18:26 +0100
Subject: [PATCH 1/7] Improve documentation

---
 docs/source/inference.mdx       |  17 +++-
 docs/source/optimization_ov.mdx | 141 +++++++++++++++++++-------------
 2 files changed, 99 insertions(+), 59 deletions(-)
diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx
index 65480c1d2f..092c881420 100644
--- a/docs/source/inference.mdx
+++ b/docs/source/inference.mdx
@@ -99,13 +99,22 @@ tokenizer.save_pretrained(save_directory)
 
 ### Weight-only quantization
 
-You can also apply 8-bit or 4-bit weight quantization when exporting your model with the CLI by setting the `weight-format` argument to respectively `int8` or `int4`:
+You can also apply fp16, 8-bit or 4-bit weight quantization on the linear and embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
 
 ```bash
 optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
 ```
 
-This will result in the exported model linear and embedding layers to be quantized to INT8 or INT4, the activations will be kept in floating point precision. This type of optimization allows reducing the footprint and latency of LLMs.
+This type of optimization allows to reduce the memory footprint and inference latency.
+
+
+| `--weight-format` |
+|-------------------|
+| `fp32`            |
+| `fp16`            |
+| `int8`            |
+| `int4`            |
+
 
 By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
 
@@ -113,7 +122,7 @@ For INT4 quantization you can also specify the following arguments :
 * The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
 * The `--ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
 
-Smaller `group_size` and `ratio` of usually improve accuracy at the sacrifice of the model size and inference latency.
+Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
 
 You can also apply 8-bit quantization on your model's weight when loading your model by setting the `load_in_8bit=True` argument when calling the `from_pretrained()` method.
 
@@ -125,7 +134,7 @@ model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
 
 <Tip warning={true}>
 
-`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
+`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. You can disable it with `load_in_8bit=False`.
 
 </Tip>
 
diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx
index 70c98f14f7..e21f45f6aa 100644
--- a/docs/source/optimization_ov.mdx
+++ b/docs/source/optimization_ov.mdx
@@ -19,15 +19,72 @@ limitations under the License.
 🤗 Optimum Intel provides an `openvino` package that enables you to apply a variety of model compression methods such as quantization, pruning, on many models hosted on the 🤗 hub using the [NNCF](https://docs.openvino.ai/2022.1/docs_nncf_introduction.html) framework.
 
 
-## Post-training optimization
+## Post-training
 
-Post-training static quantization introduces an additional calibration step where data is fed through the network in order to compute the activations quantization parameters.
-Here is how to apply static quantization on a fine-tuned DistilBERT:
+Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.
+
+### Weight-only quantization
+
+Quantization can be applied on the model's linear and embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
+
+
+#### 8-bit
+
+For the 8-bit weight quantization you can set `load_in_8bit=True` to load your model's weights in 8-bit:
 
 ```python
-from functools import partial
-from transformers import  AutoTokenizer
-from optimum.intel import OVConfig, OVQuantizer, OVModelForSequenceClassification,
+from optimum.intel import OVModelForCausalLM
+
+model_id = "helenai/gpt2-ov"
+model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
+
+# Saves the int8 model that will be x4 smaller than its fp32 counterpart
+model.save_pretrained(saving_directory)
+```
+
+<Tip warning={true}>
+
+`load_in_8bit` is enabled by default for the models larger than 1 billion parameters. You can disable it with `load_in_8bit=False`.
+
+</Tip>
+
+You can also provide a `quantization_config` instead to specify additional optimization parameters.
+
+#### 4-bit
+
+For the 4-bit weight quantization, you need a `quantization_config` to define the optimization parameters, for example:
+
+```python
+from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
+
+quantization_config = OVWeightQuantizationConfig(bits=4)
+model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
+```
+
+You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
+
+```python
+quantization_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb")
+```
+
+By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
+
+For 4-bit quantization you can also specify the following arguments in the quantization configuration :
+* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
+* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.
+
+Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.
+
+### Static quantization
+
+When applying post-training static quantization, both the weights and the activations are quantized.
+To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.
+
+Here is how to apply static quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:
+
+```python
+from transformers import AutoTokenizer
+from optimum.intel import OVQuantizer, OVModelForSequenceClassification,
 
 model_id = "distilbert-base-uncased-finetuned-sst-2-english"
 model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
@@ -35,11 +92,22 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
 # The directory where the quantized model will be saved
 save_dir = "ptq_model"
 
+quantizer = OVQuantizer.from_pretrained(model)
+
+# Apply static quantization and export the resulting quantized model to OpenVINO IR format
+quantizer.quantize(calibration_dataset=calibration_dataset, save_directory=save_dir)
+# Save the tokenizer
+tokenizer.save_pretrained(save_dir)
+```
+
+The calibration dataset can also be generated easily using your OVQuantizer:
+
+```python
+from functools import partial
+
 def preprocess_function(examples, tokenizer):
     return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)
 
-# Instantiate our OVQuantizer using the desired configuration
-quantizer = OVQuantizer.from_pretrained(model)
 # Create the calibration dataset used to perform static quantization
 calibration_dataset = quantizer.get_calibration_dataset(
     "glue",
@@ -48,33 +116,23 @@ calibration_dataset = quantizer.get_calibration_dataset(
     num_samples=300,
     dataset_split="train",
 )
-# Apply static quantization and export the resulting quantized model to OpenVINO IR format
-quantizer.quantize(
-    calibration_dataset=calibration_dataset,
-    save_directory=save_dir,
-)
-# Save the tokenizer
-tokenizer.save_pretrained(save_dir)
 ```
 
-The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
 
-##  Weight-only quantization
+The `quantize()` method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
 
-You can optimize the performance of text-generation LLMs by quantizing weights to various precisions that provide different performance-accuracy trade-offs.
 
-```python
-from optimum.intel import OVModelForCausalLM
+###  Hybrid quantization
 
-model = OVModelForCausalLM.from_pretrained(model_id, load_in_8bit=True)
-```
+Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
+The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
+Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components :
+* U-Net : quantization applied on both the weights and activations 
+* The text encoder, VAE encoder / decoder : quantization applied on the weights 
 
-##  Hybrid quantization
+The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
 
-Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
-The UNet model takes up most of the overall execution time of the pipeline. Thus, optimizing just one model brings substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial degradation of accuracy.
-Therefore, the proposal is to apply quantization in *hybrid mode* for the UNet model and weight-only quantization for the rest of the pipeline components. The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.
-The `quantization_config` is utilized to define optimization parameters for optimizing the Stable Diffusion pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. Otherwise, weight-only quantization to a specified data type (8 tr 4 bits) is applied to UNet model.
+The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.
 
 ```python
 from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig
@@ -86,38 +144,11 @@ model = OVStableDiffusionPipeline.from_pretrained(
 )
 ```
 
-<Tip warning={true}>
-
-`load_in_8bit` is enabled by default for the models larger than 1 billion parameters.
-
-</Tip>
-
-For the 4-bit weight quantization you can use the `quantization_config` to specify the optimization parameters, for example:
-
-```python
-from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
-
-model = OVModelForCausalLM.from_pretrained(
-    model_id,
-    quantization_config=OVWeightQuantizationConfig(bits=4),
-)
-```
-
-You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:
-
-```python
-from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig
-
-model = OVModelForCausalLM.from_pretrained(
-    model_id,
-    quantization_config=OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb"),
-) 
-```
 
 For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/CompressWeights.md).
 
 
-## Training-time optimization
+## Training-time
 
 Apart from optimizing a model after training like post-training quantization above, `optimum.openvino` also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
 

From 52b11db1808b55b4c396889f547325f012a82ff9 Mon Sep 17 00:00:00 2001
From: Ella Charlaix <ella@huggingface.co>
Date: Tue, 12 Mar 2024 15:21:42 +0100
Subject: [PATCH 2/7] small fix

---
 docs/source/optimization_ov.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx
index e21f45f6aa..4c4db6f6ff 100644
--- a/docs/source/optimization_ov.mdx
+++ b/docs/source/optimization_ov.mdx
@@ -100,7 +100,7 @@ quantizer.quantize(calibration_dataset=calibration_dataset, save_directory=save_
 tokenizer.save_pretrained(save_dir)
 ```
 
-The calibration dataset can also be generated easily using your OVQuantizer:
+The calibration dataset can also be created easily using your `OVQuantizer`:
 
 ```python
 from functools import partial

From 83dbe2fbd4f1a73a3bb380c1bff35da2e7d31579 Mon Sep 17 00:00:00 2001
From: Ella Charlaix <ella@huggingface.co>
Date: Tue, 12 Mar 2024 15:41:27 +0100
Subject: [PATCH 3/7] remove

---
 docs/source/inference.mdx | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx
index 092c881420..12890d6b13 100644
--- a/docs/source/inference.mdx
+++ b/docs/source/inference.mdx
@@ -108,14 +108,6 @@ optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
 This type of optimization allows to reduce the memory footprint and inference latency.
 
 
-| `--weight-format` |
-|-------------------|
-| `fp32`            |
-| `fp16`            |
-| `int8`            |
-| `int4`            |
-
-
 By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
 
 For INT4 quantization you can also specify the following arguments :

From 027c3704b7bcf449051bf50ecf33a86433138a7e Mon Sep 17 00:00:00 2001
From: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>
Date: Wed, 13 Mar 2024 12:29:46 +0100
Subject: [PATCH 4/7] Update docs/source/optimization_ov.mdx

Co-authored-by: Alexander Kozlov <alexander.kozlov@intel.com>
---
 docs/source/optimization_ov.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx
index 4c4db6f6ff..77dd208a25 100644
--- a/docs/source/optimization_ov.mdx
+++ b/docs/source/optimization_ov.mdx
@@ -67,7 +67,7 @@ You can tune quantization parameters to achieve a better performance accuracy tr
 quantization_config = OVWeightQuantizationConfig(bits=4, sym=False, ratio=0.8, dataset="ptb")
 ```
 
-By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
+By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `sym=True`.
 
 For 4-bit quantization you can also specify the following arguments in the quantization configuration :
 * The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.

From afc23d0e0e2034954e1da58a8305763021e39d52 Mon Sep 17 00:00:00 2001
From: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>
Date: Wed, 13 Mar 2024 12:33:30 +0100
Subject: [PATCH 5/7] Update docs/source/inference.mdx

Co-authored-by: Helena Kloosterman <helena.kloosterman@intel.com>
---
 docs/source/inference.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx
index 12890d6b13..ac987df1d4 100644
--- a/docs/source/inference.mdx
+++ b/docs/source/inference.mdx
@@ -99,7 +99,7 @@ tokenizer.save_pretrained(save_directory)
 
 ### Weight-only quantization
 
-You can also apply fp16, 8-bit or 4-bit weight quantization on the linear and embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
+You can also apply fp16, 8-bit or 4-bit weight compression on the linear and embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
 
 ```bash
 optimum-cli export openvino --model gpt2 --weight-format int8 ov_model

From e54dcd22acfe51b256584f220df0ed048e6e8fef Mon Sep 17 00:00:00 2001
From: Ella Charlaix <ella@huggingface.co>
Date: Wed, 13 Mar 2024 12:34:08 +0100
Subject: [PATCH 6/7] fix typo

---
 docs/source/inference.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx
index ac987df1d4..8f72bd6bbb 100644
--- a/docs/source/inference.mdx
+++ b/docs/source/inference.mdx
@@ -108,7 +108,7 @@ optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
 This type of optimization allows to reduce the memory footprint and inference latency.
 
 
-By default the quantization scheme will be [assymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
+By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md#symmetric-quantization) you can add `--sym`.
 
 For INT4 quantization you can also specify the following arguments :
 * The `--group-size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.

From be866d49e75cd83771d0bc990760d05ef48e7b92 Mon Sep 17 00:00:00 2001
From: Ella Charlaix <ella@huggingface.co>
Date: Wed, 13 Mar 2024 12:37:39 +0100
Subject: [PATCH 7/7] update paragraph

---
 docs/source/inference.mdx       | 2 +-
 docs/source/optimization_ov.mdx | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx
index 8f72bd6bbb..e0b60baa2e 100644
--- a/docs/source/inference.mdx
+++ b/docs/source/inference.mdx
@@ -99,7 +99,7 @@ tokenizer.save_pretrained(save_directory)
 
 ### Weight-only quantization
 
-You can also apply fp16, 8-bit or 4-bit weight compression on the linear and embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
+You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when exporting your model with the CLI by setting `--weight-format` to respectively `fp16`, `int8` or `int4`:
 
 ```bash
 optimum-cli export openvino --model gpt2 --weight-format int8 ov_model
diff --git a/docs/source/optimization_ov.mdx b/docs/source/optimization_ov.mdx
index 77dd208a25..1e78c36805 100644
--- a/docs/source/optimization_ov.mdx
+++ b/docs/source/optimization_ov.mdx
@@ -25,7 +25,7 @@ Quantization is a technique to reduce the computational and memory costs of runn
 
 ### Weight-only quantization
 
-Quantization can be applied on the model's linear and embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
+Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.
 
 
 #### 8-bit