intel
diff --git a/‎README.md
+6-8 b/‎README.md
+6-8
diff --git a/‎docs/source/3x/PT_FP8Quant.md
+50-26 b/‎docs/source/3x/PT_FP8Quant.md
+50-26
diff --git a/‎docs/source/faq.md
+16 b/‎docs/source/faq.md
+16
diff --git a/‎docs/source/publication_list.md
+4-2 b/‎docs/source/publication_list.md
+4-2
diff --git a/‎examples/.config/model_params_pytorch_3x.json
+8 b/‎examples/.config/model_params_pytorch_3x.json
+8
@@ -5,7 +5,7 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, and ONNX Runtime)</h3>
 
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-3.1.1-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-3.3-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -54,9 +54,9 @@ pip install neural-compressor[tf]
 After successfully installing these packages, try your first quantization program. **Following example code demonstrates FP8 Quantization**, it is supported by Intel Gaudi2 AI Accelerator.     
 To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).    
 
-Run a container with an interactive shell,
+Run a container with an interactive shell, [more info](https://docs.habana.ai/en/latest/Installation_Guide/Additional_Installation/Docker_Installation.html#docker-installation)
 ```
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.20.0/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
 ```
 Run the example,
 ```python
@@ -173,12 +173,10 @@ model = load(
 
 ## Selected Publications/Events
 
+* arXiv: [Faster Inference of LLMs using FP8 on the Intel Gaudi](https://arxiv.org/abs/2503.09975) (Mar 2025)
+* PyTorch landscape: [PyTorch general optimizations](https://landscape.pytorch.org/) (Mar 2025)
+* Blog on SqueezeBits: [[Intel Gaudi] #4. FP8 Quantization](https://blog.squeezebits.com/intel-gaudi-4-fp8-quantization--40269) (Jan 2025)
 * EMNLP'2024: [Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) (Sep 2024)
-* Blog on Medium: [Quantization on Intel Gaudi Series AI Accelerators](https://medium.com/intel-analytics-software/intel-neural-compressor-v3-0-a-quantization-tool-across-intel-hardware-9856adee6f11) (Aug 2024)
-* Blog by Intel: [Neural Compressor: Boosting AI Model Efficiency](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Neural-Compressor-Boosting-AI-Model-Efficiency/post/1604740) (June 2024)
-* Blog by Intel: [Optimization of Intel AI Solutions for Alibaba Cloud’s Qwen2 Large Language Models](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-ai-solutions-accelerate-alibaba-qwen2-llms.html) (June 2024)
-* Blog by Intel: [Accelerate Meta* Llama 3 with Intel AI Solutions](https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-meta-llama3-with-intel-ai-solutions.html) (Apr 2024)
-* EMNLP'2023 (Under Review): [TEQ: Trainable Equivalent Transformation for Quantization of LLMs](https://openreview.net/forum?id=iaI8xEINAf&referrer=%5BAuthor%20Console%5D) (Sep 2023)
 * arXiv: [Efficient Post-training Quantization with FP8 Formats](https://arxiv.org/abs/2309.14592) (Sep 2023)
 * arXiv: [Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs](https://arxiv.org/abs/2309.05516) (Sep 2023)
 
 
@@ -129,47 +129,71 @@ mistralai/Mistral-Nemo-Instruct-2407
 
 ### Running with FP8
 Refer to [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation#running-with-fp8).    
-Change "--model_name_or_path" to be your model like    
-"meta-llama/Llama-3.1-8B-Instruct",    
-"Qwen/Qwen2.5-7B-Instruct", or     
-"mistralai/Mixtral-8x7B-Instruct-v0.1" and so on.    
-"--use_kv_cache" is to enable FP8 KV cache.
+Change "--model_name_or_path" to be your model like someone in the above models list. "--use_kv_cache" is or not to enable FP8 KV cache.
 
 ### Profiling
-Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of run_generation.py.     
+Add "--profiling_warmup_steps 5 --profiling_steps 2 --profiling_record_shapes" as args in the end of commandline of `run_generation.py`.     
 Refer to [torch.profiler.ProfilerActivity.HPU](https://github.com/huggingface/optimum-habana/blob/c9e1c23620618e2f260c92c46dfeb163545ec5ba/optimum/habana/utils.py#L305).    
 
 ### FP8 Accuracy 
 "lm_eval.tasks", "lm_eval.evaluator", "lm_eval" are installed from the above requirements_lm_eval.txt. The tasks can be set and the default is ["hellaswag", "lambada_openai", "piqa", "winogrande"], [more info](https://github.com/EleutherAI/lm-evaluation-harness/)    
 
-| `Llama-2-7b-hf`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
+| `Llama-3.1-8B-Instruct`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
 |---------------|---------|--------|
-| hellaswag     | 0.5691097390957977   | 0.5704043019318861    |
-| lambada_openai| 0.7360760721909567   | 0.7372404424607025 | 
-| piqa          | 0.7850924918389554   | 0.7818280739934712 |
-| winogrande    | 0.6929755327545383   | 0.6929755327545383 |
+| lambada_openai| 0.7299  | 0.7359 |
+| hellaswag     | 0.5892  | 0.5911 |
+| piqa          | 0.7965  | 0.7998 |
+| winogrande    | 0.7474  | 0.7372 |
+| mmlu          | 0.6599  | 0.6829 |
 
-| `Qwen2.5-7B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
+| `Phi-3-mini-4k-instruct`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
 |---------------|---------|--------|
-| hellaswag     |  0.2539334793865764  |   0.2539334793865764    |
-| lambada_openai| 0.0   | 0.0 | 
-| piqa          | 0.5391730141458106   | 0.5391730141458106 |
-| winogrande    | 0.4956590370955012  | 0.4956590370955012 |
+| lambada_openai| 0.6420  | 0.6552 |
+| hellaswag     | 0.5866  | 0.5902 |
+| piqa          | 0.8041  | 0.8014 |
+| winogrande    | 0.7324  | 0.7348 |
+| mmlu          | 0.7035  | 0.7055 |
 
-| `Llama-3.1-8B-Instruct`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
+| `Mistral-7B-Instruct-v0.2`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
 |---------------|---------|--------|
-| hellaswag     | 0.5934076877116112   |   0.5975901214897431    |
-| lambada_openai| 0.7230739375121289   | 0.7255967397632447 | 
-| piqa          | 0.7932535364526659   | 0.8030467899891186 |
-| winogrande    | 0.7434885556432518  | 0.7371744277821626 |
+| lambada_openai| 0.7126  | 0.7165 |
+| hellaswag     | 0.6556  | 0.6609 |
+| piqa          | 0.8014  | 0.8025 |
+| winogrande    | 0.7253  | 0.7388 |
+| mmlu          | 0.5833  | 0.5919 |
 
+| `Mistral-Nemo-Instruct-2407`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
+|---------------|---------|--------|
+| lambada_openai| 0.7568  | 0.7596 |
+| hellaswag     | 0.6273  | 0.6325 |
+| piqa          | 0.8150  | 0.8085 |
+| winogrande    | 0.7419  | 0.7482 |
+| mmlu          | 0.6684  | 0.6840 |
+
+| `bigscience/bloom-7b1`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
+|---------------|---------|--------|
+| lambada_openai| 0.5599  | 0.5731 |
+| hellaswag     | 0.4632  | 0.4639 |
+| piqa          | 0.7301  | 0.7242 |
+| winogrande    | 0.6314  | 0.6393 |
+| mmlu          | 0.2563  | 0.2572 |
+
+| `Mixtral-8x7B-Instruct-v0.1`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
+|---------------|---------|--------|
+| lambada_openai| 0.7805  | 0.7778 |
+| hellaswag     | 0.6733  | 0.6764 |
+| piqa          | 0.8324  | 0.8351 |
+| winogrande    | 0.7680  | 0.7672 |
+| mmlu          | 0.7031  | 0.7026 |
 
-| `Mixtral-8x7B-Instruct-v0.1`| fp8 & fp8 KVCache| bf16 w/ bf16 KVCache|
+| `EleutherAI/gpt-j-6b`| fp8 w/ fp8 KVCache| bf16 w/ bf16 KVCache|
 |---------------|---------|--------|
-| hellaswag     | 0.25323640709022105   |   0.25323640709022105    |
-| lambada_openai| 0.0   | 0.0  |
-| piqa          | 0.528835690968444   | 0.528835690968444  |
-| winogrande    | 0.4956590370955012  | 0.4956590370955012 |
+| lambada_openai| 0.6769  | 0.6781 |
+| hellaswag     | 0.4928  | 0.4958 |
+| piqa          | 0.7557  | 0.7541 |
+| winogrande    | 0.6409  | 0.6425 |
+| mmlu          | 0.2524  | 0.2606 |
+> Notes: For gpt-j model, if `--use_kv_cache` is set to enable KVCache quantization, `--reuse_cache` should also be set.    
 
 ## VLLM example
 ### Overview
 
@@ -32,3 +32,19 @@ torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed
 [AutoGPTQ/AutoGPTQ#196](https://github.com/AutoGPTQ/AutoGPTQ/issues/196). 
 Try increasing `percdamp` (percent of the average Hessian diagonal to use for dampening), 
 or increasing `nsamples` (the number of calibration samples).
+#### Issue 7:  
+If you run GPTQ quantization with transformers-like API on xpu device, then you may encounter the following error:  
+```shell
+[ERROR][modeling_auto.py:128] index 133 is out of bounds for dimension 0 with size 128
+[ERROR][modeling_auto.py:129] Saved low bit model loading failed, please check your model.
+HINT:
+XPU device does not support `g_idx` for GPTQ quantization now. Please stay tuned.
+You can set desc_act=False.
+```
+#### Issue 8:
+UnicodeEncodeError: 'charmap' codec can't encode character '\u2191' in position 195: character maps to <undefined>
+**Solution:**
+```
+set PYTHONIOENCODING=UTF-8 # for windows
+export PYTHONIOENCODING=UTF-8 # for linux
+```
@@ -1,6 +1,8 @@
-Full Publications/Events (87)
+Full Publications/Events (89)
 ==========
-## 2025 (1)
+## 2025 (3)
+* arXiv: [Faster Inference of LLMs using FP8 on the Intel Gaudi](https://arxiv.org/abs/2503.09975) (Mar 2025)
+* PyTorch landscape: [PyTorch general optimizations](https://landscape.pytorch.org/) (Mar 2025)
 * Blog on SqueezeBits: [[Intel Gaudi] #4. FP8 Quantization](https://blog.squeezebits.com/intel-gaudi-4-fp8-quantization--40269) (Jan 2025)
 
 ## 2024 (7)
 
@@ -143,6 +143,14 @@
       "main_script": "run_clm_no_trainer.py",
       "batch_size": 1
     },
+    "phi3_vlm_128k_autoround_int4":{
+      "model_src_dir": "multimodal-modeling/quantization/auto_round",
+      "dataset_location": "",
+      "input_model": "",
+      "main_script": "mllm.py",
+      "batch_size": 8,
+      "iters": 50
+    },
     "gpt_j_ipex":{
       "model_src_dir": "nlp/huggingface_models/language-modeling/quantization/static_quant/ipex",
       "dataset_location": "",